Commit Graph

35195 Commits

Author SHA1 Message Date
Theodore Ts'o 1c4af2535c ext4: use i_size_read in ext4_unaligned_aio()
commit 6e6358fc3c upstream.

We haven't taken i_mutex yet, so we need to use i_size_read().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:37 -07:00
Theodore Ts'o de65d2a222 ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()
commit 622cad1325 upstream.

The function ext4_update_i_disksize() is used in only one place, in
the function mpage_map_and_submit_extent().  Move its code to simplify
the code paths, and also move the call to ext4_mark_inode_dirty() into
the i_data_sem's critical region, to be consistent with all of the
other places where we update i_disksize.  That way, we also keep the
raw_inode's i_disksize protected, to avoid the following race:

      CPU #1                                 CPU #2

   down_write(&i_data_sem)
   Modify i_disk_size
   up_write(&i_data_sem)
                                        down_write(&i_data_sem)
                                        Modify i_disk_size
                                        Copy i_disk_size to on-disk inode
                                        up_write(&i_data_sem)
   Copy i_disk_size to on-disk inode

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:37 -07:00
Jan Kara 58b9dc7b5d ext4: fix jbd2 warning under heavy xattr load
commit ec4cb1aa2b upstream.

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:37 -07:00
Matthew Wilcox 6309a184a6 ext4: note the error in ext4_end_bio()
commit 9503c67c93 upstream.

ext4_end_bio() currently throws away the error that it receives.  Chances
are this is part of a spate of errors, one of which will end up getting
the error returned to userspace somehow, but we shouldn't take that risk.
Also print out the errno to aid in debug.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:37 -07:00
Kazuya Mio 549b3cf1c4 ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS
commit 4adb6ab3e0 upstream.

When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().

To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.

Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Al Viro fc7b1646bf smarter propagate_mnt()
commit f2ebb3a921 upstream.

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Tetsuo Handa e6713b5e47 ocfs2: fix panic on kfree(xattr->name)
commit f81c20158f upstream.

Commit 9548906b2b ('xattr: Constify ->name member of "struct xattr"')
missed that ocfs2 is calling kfree(xattr->name).  As a result, kernel
panic occurs upon calling kfree(xattr->name) because xattr->name refers
static constant names.  This patch removes kfree(xattr->name) from
ocfs2_mknod() and ocfs2_symlink().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Tested-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
alex chen 65155d0493 ocfs2: do not put bh when buffer_uptodate failed
commit f7cf4f5bfe upstream.

Do not put bh when buffer_uptodate failed in ocfs2_write_block and
ocfs2_write_super_or_backup, because it will put bh in b_end_io.
Otherwise it will hit a warning "VFS: brelse: Trying to free free
buffer".

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Junxiao Bi 798c267521 ocfs2: dlm: fix recovery hung
commit ded2cf7141 upstream.

There is a race window in dlm_do_recovery() between dlm_remaster_locks()
and dlm_reset_recovery() when the recovery master nearly finish the
recovery process for a dead node.  After the master sends FINALIZE_RECO
message in dlm_remaster_locks(), another node may become the recovery
master for another dead node, and then send the BEGIN_RECO message to
all the nodes included the old master, in the handler of this message
dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
dlm->reco.new_master will be set to the second dead node and the new
master, then in dlm_reset_recovery(), these two variables will be reset
to default value.  This will cause new recovery master can not finish
the recovery process and hung, at last the whole cluster will hung for
recovery.

old recovery master:                                 new recovery master:
dlm_remaster_locks()
                                                  become recovery master for
                                                  another dead node.
                                                  dlm_send_begin_reco_message()
dlm_begin_reco_handler()
{
 if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
  return -EAGAIN;
 }
 dlm_set_reco_master(dlm, br->node_idx);
 dlm_set_reco_dead_node(dlm, br->dead_node);
}
dlm_reset_recovery()
{
 dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
 dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
}
                                                  will hang in dlm_remaster_locks() for
                                                  request dlm locks info

Before send FINALIZE_RECO message, recovery master should set
DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
this can break the race windows as the BEGIN_RECO messages will not be
handled before DLM_RECO_STATE_FINALIZE flag is cleared.

A similar race may happen between new recovery master and normal node
which is in dlm_finalize_reco_handler(), also fix it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:36 -07:00
Junxiao Bi de3778076c ocfs2: dlm: fix lock migration crash
commit 34aa8dac48 upstream.

This issue was introduced by commit 800deef3f6 ("ocfs2: use
list_for_each_entry where benefical") in 2007 where it replaced
list_for_each with list_for_each_entry.  The variable "lock" will point
to invalid data if "tmpq" list is empty and a panic will be triggered
due to this.  Sunil advised reverting it back, but the old version was
also not right.  At the end of the outer for loop, that
list_for_each_entry will also set "lock" to an invalid data, then in the
next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
data and cause the panic.  So reverting the list_for_each back and reset
"lock" to NULL to fix this issue.

Another concern is that this seemes can not happen because the "tmpq"
list should not be empty.  Let me describe how.

old lock resource owner(node 1):                                  migratation target(node 2):
image there's lockres with a EX lock from node 2 in
granted list, a NR lock from node x with convert_type
EX in converting list.
dlm_empty_lockres() {
 dlm_pick_migration_target() {
   pick node 2 as target as its lock is the first one
   in granted list.
 }
 dlm_migrate_lockres() {
   dlm_mark_lockres_migrating() {
     res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
     wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
	 //after the above code, we can not dirty lockres any more,
     // so dlm_thread shuffle list will not run
                                                                   downconvert lock from EX to NR
                                                                   upconvert lock from NR to EX
<<< migration may schedule out here, then
<<< node 2 send down convert request to convert type from EX to
<<< NR, then send up convert request to convert type from NR to
<<< EX, at this time, lockres granted list is empty, and two locks
<<< in the converting list, node x up convert lock followed by
<<< node 2 up convert lock.

	 // will set lockres RES_MIGRATING flag, the following
	 // lock/unlock can not run
     dlm_lockres_release_ast(dlm, res);
   }

   dlm_send_one_lockres()
                                                                 dlm_process_recovery_data()
                                                                   for (i=0; i<mres->num_locks; i++)
                                                                     if (ml->node == dlm->node_num)
                                                                       for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
                                                                        list_for_each_entry(lock, tmpq, list)
                                                                        if (lock) break; <<< lock is invalid as grant list is empty.
                                                                       }
                                                                       if (lock->ml.node != ml->node)
                                                                         BUG() >>> crash here
 }

I see the above locks status from a vmcore of our internal bug.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:35 -07:00
Jeff Mahoney 3defef4e96 reiserfs: fix race in readdir
commit 01d8885785 upstream.

jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2)

The -ENOENT is due to readdir calling dir_emit on the same entry twice.

If the dir_emit callback sleeps and the tree is changed underneath us,
we won't be able to trust deh_offset(deh) anymore. We need to save
next_pos before we might sleep so we can find the next entry.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:29 -07:00
Jeff Layton 43222bc58f nfsd: set timeparms.to_maxval in setup_callback_client
commit 3758cf7e14 upstream.

...otherwise the logic in the timeout handling doesn't work correctly.

Spotted-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
Kinglong Mee c54b5d2149 NFSD: Traverse unconfirmed client through hash-table
commit 2b90563598 upstream.

When stopping nfsd, I got BUG messages, and soft lockup messages,
The problem is cuased by double rb_erase() in nfs4_state_destroy_net()
and destroy_client().

This patch just let nfsd traversing unconfirmed client through
hash-table instead of rbtree.

[ 2325.021995] BUG: unable to handle kernel NULL pointer dereference at
          (null)
[ 2325.022809] IP: [<ffffffff8133c18c>] rb_erase+0x14c/0x390
[ 2325.022982] PGD 7a91b067 PUD 7a33d067 PMD 0
[ 2325.022982] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2325.022982] Modules linked in: nfsd(OF) cfg80211 rfkill bridge stp
llc snd_intel8x0 snd_ac97_codec ac97_bus auth_rpcgss nfs_acl serio_raw
e1000 i2c_piix4 ppdev snd_pcm snd_timer lockd pcspkr joydev parport_pc
snd parport i2c_core soundcore microcode sunrpc ata_generic pata_acpi
[last unloaded: nfsd]
[ 2325.022982] CPU: 1 PID: 2123 Comm: nfsd Tainted: GF          O
3.14.0-rc8+ #2
[ 2325.022982] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[ 2325.022982] task: ffff88007b384800 ti: ffff8800797f6000 task.ti:
ffff8800797f6000
[ 2325.022982] RIP: 0010:[<ffffffff8133c18c>]  [<ffffffff8133c18c>]
rb_erase+0x14c/0x390
[ 2325.022982] RSP: 0018:ffff8800797f7d98  EFLAGS: 00010246
[ 2325.022982] RAX: ffff880079c1f010 RBX: ffff880079f4c828 RCX:
0000000000000000
[ 2325.022982] RDX: 0000000000000000 RSI: ffff880079bcb070 RDI:
ffff880079f4c810
[ 2325.022982] RBP: ffff8800797f7d98 R08: 0000000000000000 R09:
ffff88007964fc70
[ 2325.022982] R10: 0000000000000000 R11: 0000000000000400 R12:
ffff880079f4c800
[ 2325.022982] R13: ffff880079bcb000 R14: ffff8800797f7da8 R15:
ffff880079f4c860
[ 2325.022982] FS:  0000000000000000(0000) GS:ffff88007f900000(0000)
knlGS:0000000000000000
[ 2325.022982] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2325.022982] CR2: 0000000000000000 CR3: 000000007a3ef000 CR4:
00000000000006e0
[ 2325.022982] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2325.022982] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 2325.022982] Stack:
[ 2325.022982]  ffff8800797f7de0 ffffffffa0191c6e ffff8800797f7da8
ffff8800797f7da8
[ 2325.022982]  ffff880079f4c810 ffff880079bcb000 ffffffff81cc26c0
ffff880079c1f010
[ 2325.022982]  ffff880079bcb070 ffff8800797f7e28 ffffffffa01977f2
ffff8800797f7df0
[ 2325.022982] Call Trace:
[ 2325.022982]  [<ffffffffa0191c6e>] destroy_client+0x32e/0x3b0 [nfsd]
[ 2325.022982]  [<ffffffffa01977f2>] nfs4_state_shutdown_net+0x1a2/0x220
[nfsd]
[ 2325.022982]  [<ffffffffa01700b8>] nfsd_shutdown_net+0x38/0x70 [nfsd]
[ 2325.022982]  [<ffffffffa017013e>] nfsd_last_thread+0x4e/0x80 [nfsd]
[ 2325.022982]  [<ffffffffa001f1eb>] svc_shutdown_net+0x2b/0x30 [sunrpc]
[ 2325.022982]  [<ffffffffa017064b>] nfsd_destroy+0x5b/0x80 [nfsd]
[ 2325.022982]  [<ffffffffa0170773>] nfsd+0x103/0x130 [nfsd]
[ 2325.022982]  [<ffffffffa0170670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 2325.022982]  [<ffffffff810a8232>] kthread+0xd2/0xf0
[ 2325.022982]  [<ffffffff810a8160>] ? insert_kthread_work+0x40/0x40
[ 2325.022982]  [<ffffffff816c493c>] ret_from_fork+0x7c/0xb0
[ 2325.022982]  [<ffffffff810a8160>] ? insert_kthread_work+0x40/0x40
[ 2325.022982] Code: 48 83 e1 fc 48 89 10 0f 84 02 01 00 00 48 3b 41 10
0f 84 08 01 00 00 48 89 51 08 48 89 fa e9 74 ff ff ff 0f 1f 40 00 48 8b
50 10 <f6> 02 01 0f 84 93 00 00 00 48 8b 7a 10 48 85 ff 74 05 f6 07 01
[ 2325.022982] RIP  [<ffffffff8133c18c>] rb_erase+0x14c/0x390
[ 2325.022982]  RSP <ffff8800797f7d98>
[ 2325.022982] CR2: 0000000000000000
[ 2325.022982] ---[ end trace 28c27ed011655e57 ]---

[  228.064071] BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:558]
[  228.064428] Modules linked in: ip6t_rpfilter ip6t_REJECT cfg80211
xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw
ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security
iptable_raw nfsd(OF) auth_rpcgss nfs_acl lockd snd_intel8x0
snd_ac97_codec ac97_bus joydev snd_pcm snd_timer e1000 sunrpc snd ppdev
parport_pc serio_raw pcspkr i2c_piix4 microcode parport soundcore
i2c_core ata_generic pata_acpi
[  228.064539] CPU: 0 PID: 558 Comm: nfsd Tainted: GF          O
3.14.0-rc8+ #2
[  228.064539] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  228.064539] task: ffff880076adec00 ti: ffff880074616000 task.ti:
ffff880074616000
[  228.064539] RIP: 0010:[<ffffffff8133ba17>]  [<ffffffff8133ba17>]
rb_next+0x27/0x50
[  228.064539] RSP: 0018:ffff880074617de0  EFLAGS: 00000282
[  228.064539] RAX: ffff880074478010 RBX: ffff88007446f860 RCX:
0000000000000014
[  228.064539] RDX: ffff880074478010 RSI: 0000000000000000 RDI:
ffff880074478010
[  228.064539] RBP: ffff880074617de0 R08: 0000000000000000 R09:
0000000000000012
[  228.064539] R10: 0000000000000001 R11: ffffffffffffffec R12:
ffffea0001d11a00
[  228.064539] R13: ffff88007f401400 R14: ffff88007446f800 R15:
ffff880074617d50
[  228.064539] FS:  0000000000000000(0000) GS:ffff88007f800000(0000)
knlGS:0000000000000000
[  228.064539] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  228.064539] CR2: 00007fe9ac6ec000 CR3: 000000007a5d6000 CR4:
00000000000006f0
[  228.064539] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  228.064539] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  228.064539] Stack:
[  228.064539]  ffff880074617e28 ffffffffa01ab7db ffff880074617df0
ffff880074617df0
[  228.064539]  ffff880079273000 ffffffff81cc26c0 ffffffff81cc26c0
0000000000000000
[  228.064539]  0000000000000000 ffff880074617e48 ffffffffa01840b8
ffffffff81cc26c0
[  228.064539] Call Trace:
[  228.064539]  [<ffffffffa01ab7db>] nfs4_state_shutdown_net+0x18b/0x220
[nfsd]
[  228.064539]  [<ffffffffa01840b8>] nfsd_shutdown_net+0x38/0x70 [nfsd]
[  228.064539]  [<ffffffffa018413e>] nfsd_last_thread+0x4e/0x80 [nfsd]
[  228.064539]  [<ffffffffa00aa1eb>] svc_shutdown_net+0x2b/0x30 [sunrpc]
[  228.064539]  [<ffffffffa018464b>] nfsd_destroy+0x5b/0x80 [nfsd]
[  228.064539]  [<ffffffffa0184773>] nfsd+0x103/0x130 [nfsd]
[  228.064539]  [<ffffffffa0184670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[  228.064539]  [<ffffffff810a8232>] kthread+0xd2/0xf0
[  228.064539]  [<ffffffff810a8160>] ? insert_kthread_work+0x40/0x40
[  228.064539]  [<ffffffff816c493c>] ret_from_fork+0x7c/0xb0
[  228.064539]  [<ffffffff810a8160>] ? insert_kthread_work+0x40/0x40
[  228.064539] Code: 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48
8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 48 8b
50 10 <48> 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b 10 48 89 c7

Fixes: ac55fdc408 (nfsd: move the confirmed and unconfirmed hlists...)
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
J. Bruce Fields 0a5bb09481 nfsd4: fix setclientid encode size
commit 480efaee08 upstream.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
Yan, Zheng 647183f379 nfsd4: fix memory leak in nfsd4_encode_fattr()
commit 18df11d0ea upstream.

fh_put() does not free the temporary file handle.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
Stanislav Kinsbursky 34ef215eed nfsd: check passed socket's net matches NFSd superblock's one
commit 3064639423 upstream.

There could be a case, when NFSd file system is mounted in network, different
to socket's one, like below:

"ip netns exec" creates new network and mount namespace, which duplicates NFSd
mount point, created in init_net context. And thus NFS server stop in nested
network context leads to RPCBIND client destruction in init_net.
Then, on NFSd start in nested network context, rpc.nfsd process creates socket
in nested net and passes it into "write_ports", which leads to RPCBIND sockets
creation in init_net context because of the same reason (NFSd monut point was
created in init_net context). An attempt to register passed socket in nested
net leads to panic, because no RPCBIND client present in nexted network
namespace.

This patch add check that passed socket's net matches NFSd superblock's one.
And returns -EINVAL error to user psace otherwise.

v2: Put socket on exit.

Reported-by: Weng Meiling <wengmeiling.weng@huawei.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:28 -07:00
J. Bruce Fields f7cffcfc0f nfsd: notify_change needs elevated write count
commit 9f67f18993 upstream.

Looks like this bug has been here since these write counts were
introduced, not sure why it was just noticed now.

Thanks also to Jan Kara for pointing out the problem.

Reported-by: Matthew Rahtz <mrahtz@rapitasystems.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
J. Bruce Fields 725e71de82 nfsd4: leave reply buffer space for failed setattr
commit 04819bf644 upstream.

This fixes an ommission from 18032ca062
"NFSD: Server implementation of MAC Labeling", which increased the size
of the setattr error reply without increasing COMPOUND_ERR_SLACK_SPACE.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
J. Bruce Fields c48069a373 nfsd4: fix test_stateid error reply encoding
commit a11fcce154 upstream.

If the entire operation fails then there's nothing to encode.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
J. Bruce Fields d37c37d2d4 nfsd4: buffer-length check for SUPPATTR_EXCLCREAT
commit de3997a7ee upstream.

This was an omission from 8c18f2052e
"nfsd41: SUPPATTR_EXCLCREAT attribute".

Cc: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
J. Bruce Fields 6cd9ecd595 nfsd4: session needs room for following op to error out
commit 4c69d5855a upstream.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
J. Bruce Fields 83b1912152 nfsd: revert v2 half of "nfsd: don't return high mode bits"
commit 082f31a216 upstream.

This reverts the part of commit 6e14b46b91
that changes NFSv2 behavior.

Mark Lord found that it broke nfs-root for Linux clients, because it
broke NFSv2.

In fact, from RFC 1094:

	"Notice that the file type is specified both in the mode bits
	and in the file type.  This is really a bug in the protocol and
	will be fixed in future versions."

So NFSv2 clients really are expected to depend on the high bits of the
mode.

Reported-by: Mark Lord <mlord@pobox.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: Johan Hovold <jhovold@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
Trond Myklebust dbeeb36eb6 NFSv4: Fix a use-after-free problem in open()
commit e911b8158e upstream.

If we interrupt the nfs4_wait_for_completion_rpc_task() call in
nfs4_run_open_task(), then we don't prevent the RPC call from
completing. So freeing up the opendata->f_attr.mdsthreshold
in the error path in _nfs4_do_open() leads to a use-after-free
when the XDR decoder tries to decode the mdsthreshold information
from the server.

Fixes: 82be417aa3 (NFSv4.1 cache mdsthreshold values on OPEN)
Tested-by: Steve Dickson <SteveD@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:27 -07:00
Li Zefan 5e7d38dd79 jffs2: remove from wait queue after schedule()
commit 3ead957844 upstream.

@wait is a local variable, so if we don't remove it from the wait queue
list, later wake_up() may end up accessing invalid memory.

This was spotted by eyes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Li Zefan bd82a0a428 jffs2: avoid soft-lockup in jffs2_reserve_space_gc()
commit 13b546d962 upstream.

We triggered soft-lockup under stress test on 2.6.34 kernel.

BUG: soft lockup - CPU#1 stuck for 60009ms! [lockf2.test:14488]
...
[<bf09a4d4>] (jffs2_do_reserve_space+0x420/0x440 [jffs2])
[<bf09a528>] (jffs2_reserve_space_gc+0x34/0x78 [jffs2])
[<bf0a1350>] (jffs2_garbage_collect_dnode.isra.3+0x264/0x478 [jffs2])
[<bf0a2078>] (jffs2_garbage_collect_pass+0x9c0/0xe4c [jffs2])
[<bf09a670>] (jffs2_reserve_space+0x104/0x2a8 [jffs2])
[<bf09dc48>] (jffs2_write_inode_range+0x5c/0x4d4 [jffs2])
[<bf097d8c>] (jffs2_write_end+0x198/0x2c0 [jffs2])
[<c00e00a4>] (generic_file_buffered_write+0x158/0x200)
[<c00e14f4>] (__generic_file_aio_write+0x3a4/0x414)
[<c00e15c0>] (generic_file_aio_write+0x5c/0xbc)
[<c012334c>] (do_sync_write+0x98/0xd4)
[<c0123a84>] (vfs_write+0xa8/0x150)
[<c0123d74>] (sys_write+0x3c/0xc0)]

Fix this by adding a cond_resched() in the while loop.

[akpm@linux-foundation.org: don't initialize `ret']
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Ajesh Kunhipurayil Vijayan a29cb38d48 jffs2: Fix crash due to truncation of csize
commit 41bf1a24c1 upstream.

mounting JFFS2 partition sometimes crashes with this call trace:

[ 1322.240000] Kernel bug detected[#1]:
[ 1322.244000] Cpu 2
[ 1322.244000] $ 0   : 0000000000000000 0000000000000018 000000003ff00070 0000000000000001
[ 1322.252000] $ 4   : 0000000000000000 c0000000f3980150 0000000000000000 0000000000010000
[ 1322.260000] $ 8   : ffffffffc09cd5f8 0000000000000001 0000000000000088 c0000000ed300de8
[ 1322.268000] $12   : e5e19d9c5f613a45 ffffffffc046d464 0000000000000000 66227ba5ea67b74e
[ 1322.276000] $16   : c0000000f1769c00 c0000000ed1e0200 c0000000f3980150 0000000000000000
[ 1322.284000] $20   : c0000000f3a80000 00000000fffffffc c0000000ed2cfbd8 c0000000f39818f0
[ 1322.292000] $24   : 0000000000000004 0000000000000000
[ 1322.300000] $28   : c0000000ed2c0000 c0000000ed2cfab8 0000000000010000 ffffffffc039c0b0
[ 1322.308000] Hi    : 000000000000023c
[ 1322.312000] Lo    : 000000000003f802
[ 1322.316000] epc   : ffffffffc039a9f8 check_tn_node+0x88/0x3b0
[ 1322.320000]     Not tainted
[ 1322.324000] ra    : ffffffffc039c0b0 jffs2_do_read_inode_internal+0x1250/0x1e48
[ 1322.332000] Status: 5400f8e3    KX SX UX KERNEL EXL IE
[ 1322.336000] Cause : 00800034
[ 1322.340000] PrId  : 000c1004 (Netlogic XLP)
[ 1322.344000] Modules linked in:
[ 1322.348000] Process jffs2_gcd_mtd7 (pid: 264, threadinfo=c0000000ed2c0000, task=c0000000f0e68dd8, tls=0000000000000000)
[ 1322.356000] Stack : c0000000f1769e30 c0000000ed010780 c0000000ed010780 c0000000ed300000
        c0000000f1769c00 c0000000f3980150 c0000000f3a80000 00000000fffffffc
        c0000000ed2cfbd8 ffffffffc039c0b0 ffffffffc09c6340 0000000000001000
        0000000000000dec ffffffffc016c9d8 c0000000f39805a0 c0000000f3980180
        0000008600000000 0000000000000000 0000000000000000 0000000000000000
        0001000000000dec c0000000f1769d98 c0000000ed2cfb18 0000000000010000
        0000000000010000 0000000000000044 c0000000f3a80000 c0000000f1769c00
        c0000000f3d207a8 c0000000f1769d98 c0000000f1769de0 ffffffffc076f9c0
        0000000000000009 0000000000000000 0000000000000000 ffffffffc039cf90
        0000000000000017 ffffffffc013fbdc 0000000000000001 000000010003e61c
        ...
[ 1322.424000] Call Trace:
[ 1322.428000] [<ffffffffc039a9f8>] check_tn_node+0x88/0x3b0
[ 1322.432000] [<ffffffffc039c0b0>] jffs2_do_read_inode_internal+0x1250/0x1e48
[ 1322.440000] [<ffffffffc039cf90>] jffs2_do_crccheck_inode+0x70/0xd0
[ 1322.448000] [<ffffffffc03a1b80>] jffs2_garbage_collect_pass+0x160/0x870
[ 1322.452000] [<ffffffffc03a392c>] jffs2_garbage_collect_thread+0xdc/0x1f0
[ 1322.460000] [<ffffffffc01541c8>] kthread+0xb8/0xc0
[ 1322.464000] [<ffffffffc0106d18>] kernel_thread_helper+0x10/0x18
[ 1322.472000]
[ 1322.472000]
Code: 67bd0050  94a4002c  2c830001 <00038036> de050218  2403fffc  0080a82d  00431824  24630044
[ 1322.480000] ---[ end trace b052bb90e97dfbf5 ]---

The variable csize in structure jffs2_tmp_dnode_info is of type uint16_t, but it
is used to hold the compressed data length(csize) which is declared as uint32_t.
So, when the value of csize exceeds 16bits, it gets truncated when assigned to
tn->csize. This is causing a kernel BUG.
Changing the definition of csize in jffs2_tmp_dnode_info to uint32_t fixes the issue.

Signed-off-by: Ajesh Kunhipurayil Vijayan <ajesh@broadcom.com>
Signed-off-by: Kamlakant Patel <kamlakant.patel@broadcom.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Kamlakant Patel 882b3a9e70 jffs2: Fix segmentation fault found in stress test
commit 3367da5610 upstream.

Creating a large file on a JFFS2 partition sometimes crashes with this call
trace:

[  306.476000] CPU 13 Unable to handle kernel paging request at virtual address c0000000dfff8002, epc == ffffffffc03a80a8, ra == ffffffffc03a8044
[  306.488000] Oops[#1]:
[  306.488000] Cpu 13
[  306.492000] $ 0   : 0000000000000000 0000000000000000 0000000000008008 0000000000008007
[  306.500000] $ 4   : c0000000dfff8002 000000000000009f c0000000e0007cde c0000000ee95fa58
[  306.508000] $ 8   : 0000000000000001 0000000000008008 0000000000010000 ffffffffffff8002
[  306.516000] $12   : 0000000000007fa9 000000000000ff0e 000000000000ff0f 80e55930aebb92bb
[  306.524000] $16   : c0000000e0000000 c0000000ee95fa5c c0000000efc80000 ffffffffc09edd70
[  306.532000] $20   : ffffffffc2b60000 c0000000ee95fa58 0000000000000000 c0000000efc80000
[  306.540000] $24   : 0000000000000000 0000000000000004
[  306.548000] $28   : c0000000ee950000 c0000000ee95f738 0000000000000000 ffffffffc03a8044
[  306.556000] Hi    : 00000000000574a5
[  306.560000] Lo    : 6193b7a7e903d8c9
[  306.564000] epc   : ffffffffc03a80a8 jffs2_rtime_compress+0x98/0x198
[  306.568000]     Tainted: G        W
[  306.572000] ra    : ffffffffc03a8044 jffs2_rtime_compress+0x34/0x198
[  306.580000] Status: 5000f8e3    KX SX UX KERNEL EXL IE
[  306.584000] Cause : 00800008
[  306.588000] BadVA : c0000000dfff8002
[  306.592000] PrId  : 000c1100 (Netlogic XLP)
[  306.596000] Modules linked in:
[  306.596000] Process dd (pid: 170, threadinfo=c0000000ee950000, task=c0000000ee6e0858, tls=0000000000c47490)
[  306.608000] Stack : 7c547f377ddc7ee4 7ffc7f967f5d7fae 7f617f507fc37ff4 7e7d7f817f487f5f
        7d8e7fec7ee87eb3 7e977ff27eec7f9e 7d677ec67f917f67 7f3d7e457f017ed7
        7fd37f517f867eb2 7fed7fd17ca57e1d 7e5f7fe87f257f77 7fd77f0d7ede7fdb
        7fba7fef7e197f99 7fde7fe07ee37eb5 7f5c7f8c7fc67f65 7f457fb87f847e93
        7f737f3e7d137cd9 7f8e7e9c7fc47d25 7dbb7fac7fb67e52 7ff17f627da97f64
        7f6b7df77ffa7ec5 80057ef17f357fb3 7f767fa27dfc7fd5 7fe37e8e7fd07e53
        7e227fcf7efb7fa1 7f547e787fa87fcc 7fcb7fc57f5a7ffb 7fc07f6c7ea97e80
        7e2d7ed17e587ee0 7fb17f9d7feb7f31 7f607e797e887faa 7f757fdd7c607ff3
        7e877e657ef37fbd 7ec17fd67fe67ff7 7ff67f797ff87dc4 7eef7f3a7c337fa6
        7fe57fc97ed87f4b 7ebe7f097f0b8003 7fe97e2a7d997cba 7f587f987f3c7fa9
        ...
[  306.676000] Call Trace:
[  306.680000] [<ffffffffc03a80a8>] jffs2_rtime_compress+0x98/0x198
[  306.684000] [<ffffffffc0394f10>] jffs2_selected_compress+0x110/0x230
[  306.692000] [<ffffffffc039508c>] jffs2_compress+0x5c/0x388
[  306.696000] [<ffffffffc039dc58>] jffs2_write_inode_range+0xd8/0x388
[  306.704000] [<ffffffffc03971bc>] jffs2_write_end+0x16c/0x2d0
[  306.708000] [<ffffffffc01d3d90>] generic_file_buffered_write+0xf8/0x2b8
[  306.716000] [<ffffffffc01d4e7c>] __generic_file_aio_write+0x1ac/0x350
[  306.720000] [<ffffffffc01d50a0>] generic_file_aio_write+0x80/0x168
[  306.728000] [<ffffffffc021f7dc>] do_sync_write+0x94/0xf8
[  306.732000] [<ffffffffc021ff6c>] vfs_write+0xa4/0x1a0
[  306.736000] [<ffffffffc02202e8>] SyS_write+0x50/0x90
[  306.744000] [<ffffffffc0116cc0>] handle_sys+0x180/0x1a0
[  306.748000]
[  306.748000]
Code: 020b202d  0205282d  90a50000 <90840000> 14a40038  00000000  0060602d  0000282d  016c5823
[  306.760000] ---[ end trace 79dd088435be02d0 ]---
Segmentation fault

This crash is caused because the 'positions' is declared as an array of signed
short. The value of position is in the range 0..65535, and will be converted
to a negative number when the position is greater than 32767 and causes a
corruption and crash. Changing the definition to 'unsigned short' fixes this
issue

Signed-off-by: Jayachandran C <jchandra@broadcom.com>
Signed-off-by: Kamlakant Patel <kamlakant.patel@broadcom.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Dan Carpenter d4654552f7 fs: NULL dereference in posix_acl_to_xattr()
commit 47ba973440 upstream.

This patch moves the dereference of "buffer" after the check for NULL.
The only place which passes a NULL parameter is gfs2_set_acl().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Eric Whitney c80e9ae4c5 ext4: fix premature freeing of partial clusters split across leaf blocks
commit ad6599ab3a upstream.

Xfstests generic/311 and shared/298 fail when run on a bigalloc file
system.  Kernel error messages produced during the tests report that
blocks to be freed are already on the to-be-freed list.  When e2fsck
is run at the end of the tests, it typically reports bad i_blocks and
bad free blocks counts.

The bug that causes these failures is located in ext4_ext_rm_leaf().
Code at the end of the function frees a partial cluster if it's not
shared with an extent remaining in the leaf.  However, if all the
extents in the leaf have been removed, the code dereferences an
invalid extent pointer (off the front of the leaf) when the check for
sharing is made.  This generally has the effect of unconditionally
freeing the partial cluster, which leads to the observed failures
when the partial cluster is shared with the last extent in the next
leaf.

Fix this by attempting to free the cluster only if extents remain in
the leaf.  Any remaining partial cluster will be freed if possible
when the next leaf is processed or when leaf removal is complete.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Eric Whitney 5ffaffe424 ext4: fix partial cluster handling for bigalloc file systems
commit c063449394 upstream.

Commit 9cb00419fa, which enables hole punching for bigalloc file
systems, exposed a bug introduced by commit 6ae06ff51e in an earlier
release.  When run on a bigalloc file system, xfstests generic/013, 068,
075, 083, 091, 100, 112, 127, 263, 269, and 270 fail with e2fsck errors
or cause kernel error messages indicating that previously freed blocks
are being freed again.

The latter commit optimizes the selection of the starting extent in
ext4_ext_rm_leaf() when hole punching by beginning with the extent
supplied in the path argument rather than with the last extent in the
leaf node (as is still done when truncating).  However, the code in
rm_leaf that initially sets partial_cluster to track cluster sharing on
extent boundaries is only guaranteed to run if rm_leaf starts with the
last node in the leaf.  Consequently, partial_cluster is not correctly
initialized when hole punching, and a cluster on the boundary of a
punched region that should be retained may instead be deallocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:06 -07:00
Eric Whitney dc5af0f7e2 ext4: fix error return from ext4_ext_handle_uninitialized_extents()
commit ce37c42919 upstream.

Commit 3779473246 breaks the return of error codes from
ext4_ext_handle_uninitialized_extents() in ext4_ext_map_blocks().  A
portion of the patch assigns that function's signed integer return
value to an unsigned int.  Consequently, negatively valued error codes
are lost and can be treated as a bogus allocated block count.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Josef Bacik 2891d0f102 Btrfs: check for an extent_op on the locked ref
commit 573a075567 upstream.

We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around.  Otherwise we
could lose flag updates which would lead to extent tree corruption.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Josef Bacik 50b51ee311 Btrfs: fix deadlock with nested trans handles
commit 3bbb24b20a upstream.

Zach found this deadlock that would happen like this

btrfs_end_transaction <- reduce trans->use_count to 0
  btrfs_run_delayed_refs
    btrfs_cow_block
      find_free_extent
	btrfs_start_transaction <- increase trans->use_count to 1
          allocate chunk
	btrfs_end_transaction <- decrease trans->use_count to 0
	  btrfs_run_delayed_refs
	    lock tree block we are cowing above ^^

We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone.  This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction.  Thanks,

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Hidetoshi Seto 53dcb28868 Btrfs: skip submitting barrier for missing device
commit f88ba6a2a4 upstream.

I got an error on v3.13:
 BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)

how to reproduce:
  > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
  > wipefs -a /dev/sdf2
  > mount -o degraded /dev/sdf1 /mnt
  > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt

The reason of the error is that barrier_all_devices() failed to submit
barrier to the missing device.  However it is clear that we cannot do
anything on missing device, and also it is not necessary to care chunks
on the missing device.

This patch stops sending/waiting barrier if device is missing.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Mark Tinguely 7de24f7b0d xfs: fix directory hash ordering bug
commit c88547a811 upstream.

Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown.  xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Jan Kara 1c23ab6f88 bdi: avoid oops on device removal
commit 5acda9d12d upstream.

After commit 839a8e8660 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Derek Basehore 67001f3a0b backing_dev: fix hung task on sync
commit 6ca738d60c upstream.

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zento.linux.org.uk>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benson Leung <bleung@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Luigi Semenzato <semenzato@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:05 -07:00
Tejun Heo 29b311867d kernfs: protect lazy kernfs_iattrs allocation with mutex
commit 4afddd60a7 upstream.

kernfs_iattrs is allocated lazily when operations which require it
take place; unfortunately, the lazy allocation and returning weren't
properly synchronized and when there are multiple concurrent
operations, it might end up returning kernfs_iattrs which hasn't
finished initialization yet or different copies to different callers.

Fix it by synchronizing with a mutex.  This can be smarter with memory
barriers but let's go there if it actually turns out to be necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:04 -07:00
Richard Cochran 52d6c48c9d kernfs: fix off by one error.
commit 88391d49ab upstream.

The hash values 0 and 1 are reserved for magic directory entries, but
the code only prevents names hashing to 0. This patch fixes the test
to also prevent hash value 1.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26 17:19:03 -07:00
Linus Torvalds fedc1ed0f1 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Switch mnt_hash to hlist, turning the races between __lookup_mnt() and
  hash modifications into false negatives from __lookup_mnt() (instead
  of hangs)"

On the false negatives from __lookup_mnt():
 "The *only* thing we care about is not getting stuck in __lookup_mnt().
  If it misses an entry because something in front of it just got moved
  around, etc, we are fine.  We'll notice that mount_lock mismatch and
  that'll be it"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch mnt_hash to hlist
  don't bother with propagate_mnt() unless the target is shared
  keep shadowed vfsmounts together
  resizable namespace.c hashes
2014-03-30 17:26:08 -07:00
Theodore Ts'o 00a1a053eb ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30 17:02:06 -07:00
Al Viro 38129a13e6 switch mnt_hash to hlist
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating.  Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.

[fix for dumb braino folded]

Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:51 -04:00
Al Viro 0b1b901b5a don't bother with propagate_mnt() unless the target is shared
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro 1d6a32acd7 keep shadowed vfsmounts together
preparation to switching mnt_hash to hlist

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro 0818bf27c0 resizable namespace.c hashes
* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:49 -04:00
Sasha Levin d9060742fb ocfs2: check if cluster name exists before deref
Commit c74a3bdd9b ("ocfs2: add clustername to cluster connection") is
trying to strlcpy a string which was explicitly passed as NULL in the
very same patch, triggering a NULL ptr deref.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: strlcpy (lib/string.c:388 lib/string.c:151)
  CPU: 19 PID: 19426 Comm: trinity-c19 Tainted: G        W     3.14.0-rc7-next-20140325-sasha-00014-g9476368-dirty #274
  RIP:  strlcpy (lib/string.c:388 lib/string.c:151)
  Call Trace:
   ocfs2_cluster_connect (fs/ocfs2/stackglue.c:350)
   ocfs2_cluster_connect_agnostic (fs/ocfs2/stackglue.c:396)
   user_dlm_register (fs/ocfs2/dlmfs/userdlm.c:679)
   dlmfs_mkdir (fs/ocfs2/dlmfs/dlmfs.c:503)
   vfs_mkdir (fs/namei.c:3467)
   SyS_mkdirat (fs/namei.c:3488 fs/namei.c:3472)
   tracesys (arch/x86/kernel/entry_64.S:749)

akpm: this patch probably disables the feature.  A temporary thing to
avoid triviel oopses.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-28 13:56:58 -07:00
Jan Kara 75c5a52da3 vfs: Allocate anon_inode_inode in anon_inode_init()
Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
[ Further simplified on suggestion from Dave Jones ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-27 09:52:54 -07:00
Linus Torvalds fce7fc79c8 fs: remove now stale label in anon_inode_init()
The previous commit removed the register_filesystem() call and the
associated error handling, but left the label for the error path that no
longer exists.  Remove that too.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-25 17:43:34 -07:00
Jan Kara d6f2589ad5 fs: Avoid userspace mounting anon_inodefs filesystem
anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-25 17:42:16 -07:00
Linus Torvalds 632b06aa28 Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix frm Bruce Fields:
 "J R Okajima sent this early and I was just slow to pass it along,
  apologies.  Fortunately it's a simple fix"

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix lost nfserrno() call in nfsd_setattr()
2014-03-25 15:24:11 -07:00