Commit Graph

1318 Commits

Author SHA1 Message Date
Joel Becker 1fc8a11786 ocfs2: Don't walk off the end of fast symlinks.
ocfs2 fast symlinks are NUL terminated strings stored inline in the
inode data area.  However, disk corruption or a local attacker could, in
theory, remove that NUL.  Because we're using strlen() (my fault,
introduced in a731d1 when removing vfs_follow_link()), we could walk off
the end of that string.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-09-29 17:33:05 -07:00
Srinivas Eeda 5dad6c39d1 o2dlm: force free mles during dlm exit
While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.

This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-23 14:16:53 -07:00
Tao Ma 0000b86202 ocfs2: Sync inode flags with ext2.
We sync our inode flags with ext2 and define them by hex
values. But actually in commit 3669567(4 years ago), all
these values are moved to include/linux/fs.h. So we'd
better also use them as what ext2 did. So sync our inode
flags with ext2 by using FS_*.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-23 14:16:49 -07:00
Tao Ma 4a452de4fd ocfs2: Move 'wanted' into parens of ocfs2_resmap_resv_bits.
The first time I read the function ocfs2_resmap_resv_bits, I consider
about what 'wanted' will be used and consider about the comments.
Then I find it is only used if the reservation is empty. ;)

So we'd better move it to the parens so that it make the code more
readable, what's more, ocfs2_resmap_resv_bits is used so frequently
and we should save some cpus.

Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-23 14:16:47 -07:00
Tao Ma 47dea42379 ocfs2: Use cpu_to_le16 for e_leaf_clusters in ocfs2_bg_discontig_add_extent.
e_leaf_clusters is a le16, so use cpu_to_le16 instead
of cpu_to_le32.

What's more, we change 'clusters' to unsigned int to
signify that the size of 'clusters' isn't important here.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-23 14:16:34 -07:00
Tao Ma 12828061cd ocfs2: update ctime when changing the file's permission by setfacl
In commit 30e2bab, ext3 fixed it. So change it accordingly in ocfs2.

Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1283760364
# setfacl -m  'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1283760364

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-23 14:16:21 -07:00
Wu Fengguang 50aff04036 ocfs2/net: fix uninitialized ret in o2net_send_message_vec()
mmotm/fs/ocfs2/cluster/tcp.c: In function ‘o2net_send_message_vec’:
mmotm/fs/ocfs2/cluster/tcp.c:980:6: warning: ‘ret’ may be used uninitialized in this function

It seems a real bug introduced by commit 9af0b38ff3 (ocfs2/net:
Use wait_event() in o2net_send_message_vec()).

cc: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-18 08:48:54 -07:00
Tristan Ye 228ac63577 Ocfs2: Handle empty list in lockres_seq_start() for dlmdebug.c
This patch tries to handle the case in which list 'dlm->tracking_list' is
empty, to avoid accessing an invalid pointer. It fixes the following oops:

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 09:19:30 -07:00
Tristan Ye 0f4da216b8 Ocfs2: Re-access the journal after ocfs2_insert_extent() in dxdir codes.
In ocfs2_dx_dir_rebalance(), we need to rejournal_acess the blocks after
calling ocfs2_insert_extent() since growing an extent tree may trigger
ocfs2_extend_trans(), which makes previous journal_access meaningless.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 09:19:11 -07:00
Tao Ma 07eaac9438 ocfs2: Fix lockdep warning in reflink.
This patch change mutex_lock to a new subclass and
add a new inode lock subclass for the target inode
which caused this lockdep warning.

=============================================
[ INFO: possible recursive locking detected ]
2.6.35+ #5
---------------------------------------------
reflink/11086 is trying to acquire lock:
 (Meta){+++++.}, at: [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]

but task is already holding lock:
 (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]

other info that might help us debug this:
6 locks held by reflink/11086:
 #0:  (&sb->s_type->i_mutex_key#15/1){+.+.+.}, at: [<ffffffff820e09ec>] lookup_create+0x26/0x97
 #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa06f99a0>] ocfs2_reflink_ioctl+0x4d3/0x1229 [ocfs2]
 #2:  (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
 #3:  (&oi->ip_xattr_sem){+.+.+.}, at: [<ffffffffa06f9b58>] ocfs2_reflink_ioctl+0x68b/0x1229 [ocfs2]
 #4:  (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa06f9b67>] ocfs2_reflink_ioctl+0x69a/0x1229 [ocfs2]
 #5:  (&sb->s_type->i_mutex_key#15/2){+.+...}, at: [<ffffffffa06f9d4f>] ocfs2_reflink_ioctl+0x882/0x1229 [ocfs2]

stack backtrace:
Pid: 11086, comm: reflink Not tainted 2.6.35+ #5
Call Trace:
 [<ffffffff82063dd9>] validate_chain+0x56e/0xd68
 [<ffffffff82062275>] ? mark_held_locks+0x49/0x69
 [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
 [<ffffffff82065a81>] lock_acquire+0xc6/0xed
 [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
 [<ffffffffa06c9ade>] __ocfs2_cluster_lock+0x975/0xa0d [ocfs2]
 [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
 [<ffffffffa06e107b>] ? ocfs2_wait_for_recovery+0x15/0x8a [ocfs2]
 [<ffffffffa06cb6ea>] ocfs2_inode_lock_full_nested+0x1ac/0xdc5 [ocfs2]
 [<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
 [<ffffffff820623a0>] ? trace_hardirqs_on_caller+0x10b/0x12f
 [<ffffffff82060193>] ? debug_mutex_free_waiter+0x4f/0x53
 [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
 [<ffffffffa06ce24a>] ? ocfs2_file_lock_res_init+0x66/0x78 [ocfs2]
 [<ffffffff820bb2d2>] ? might_fault+0x40/0x8d
 [<ffffffffa06df9f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
 [<ffffffff820ee5d3>] ? mntput_no_expire+0x1d/0xb0
 [<ffffffff820e07b3>] ? path_put+0x2c/0x31
 [<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
 [<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
 [<ffffffff8233a7f6>] ? _raw_spin_unlock+0x26/0x2a
 [<ffffffff8200299c>] ? sysret_check+0x27/0x62
 [<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
 [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 09:19:06 -07:00
Tao Ma 5e64b0d9e8 ocfs2/lockdep: Move ip_xattr_sem out of ocfs2_xattr_get_nolock.
As the name shows, we shouldn't have any lock in
ocfs2_xattr_get_nolock. so lift ip_xattr_sem to the caller.
This should be safe for us since the only 2 callers are:
1. ocfs2_xattr_get which will lock the resources.
2. ocfs2_mknod which don't need this locking.

And this also resolves the following lockdep warning.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.35+ #5
-------------------------------------------------------
reflink/30027 is trying to acquire lock:
 (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]

but task is already holding lock:
 (&oi->ip_xattr_sem){++++..}, at: [<ffffffffa0673b58>] ocfs2_reflink_ioctl+0x68b/0x1226 [ocfs2]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&oi->ip_xattr_sem){++++..}:
       [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
       [<ffffffff82065a81>] lock_acquire+0xc6/0xed
       [<ffffffff82339650>] down_read+0x34/0x47
       [<ffffffffa0691cb8>] ocfs2_xattr_get_nolock+0xa0/0x4e6 [ocfs2]
       [<ffffffffa069d64f>] ocfs2_get_acl_nolock+0x5c/0x132 [ocfs2]
       [<ffffffffa069d9c7>] ocfs2_init_acl+0x60/0x243 [ocfs2]
       [<ffffffffa066499d>] ocfs2_mknod+0xae8/0xfea [ocfs2]
       [<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
       [<ffffffff820e1c83>] vfs_create+0x9b/0xf4
       [<ffffffff820e20bb>] do_last+0x2fd/0x5be
       [<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
       [<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
       [<ffffffff820d6dac>] sys_open+0x1b/0x1d
       [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b

-> #2 (jbd2_handle){+.+...}:
       [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
       [<ffffffff82065a81>] lock_acquire+0xc6/0xed
       [<ffffffffa0604ff8>] start_this_handle+0x4a3/0x4bc [jbd2]
       [<ffffffffa06051d6>] jbd2__journal_start+0xba/0xee [jbd2]
       [<ffffffffa0605218>] jbd2_journal_start+0xe/0x10 [jbd2]
       [<ffffffffa065ca34>] ocfs2_start_trans+0xb7/0x19b [ocfs2]
       [<ffffffffa06645f3>] ocfs2_mknod+0x73e/0xfea [ocfs2]
       [<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
       [<ffffffff820e1c83>] vfs_create+0x9b/0xf4
       [<ffffffff820e20bb>] do_last+0x2fd/0x5be
       [<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
       [<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
       [<ffffffff820d6dac>] sys_open+0x1b/0x1d
       [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b

-> #1 (&journal->j_trans_barrier){.+.+..}:
       [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
       [<ffffffff82064fa9>] lock_release_non_nested+0x1e5/0x24b
       [<ffffffff82065999>] lock_release+0x158/0x17a
       [<ffffffff823389f6>] __mutex_unlock_slowpath+0xbf/0x11b
       [<ffffffff82338a5b>] mutex_unlock+0x9/0xb
       [<ffffffffa0679673>] ocfs2_free_ac_resource+0x31/0x67 [ocfs2]
       [<ffffffffa067c6bc>] ocfs2_free_alloc_context+0x11/0x1d [ocfs2]
       [<ffffffffa0633de0>] ocfs2_write_begin_nolock+0x141e/0x159b [ocfs2]
       [<ffffffffa0635523>] ocfs2_write_begin+0x11e/0x1e7 [ocfs2]
       [<ffffffff820a1297>] generic_file_buffered_write+0x10c/0x210
       [<ffffffffa0653624>] ocfs2_file_aio_write+0x4cc/0x6d3 [ocfs2]
       [<ffffffff820d822d>] do_sync_write+0xc2/0x106
       [<ffffffff820d897b>] vfs_write+0xae/0x131
       [<ffffffff820d8e55>] sys_write+0x47/0x6f
       [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b

-> #0 (&oi->ip_alloc_sem){+.+.+.}:
       [<ffffffff82063f92>] validate_chain+0x727/0xd68
       [<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
       [<ffffffff82065a81>] lock_acquire+0xc6/0xed
       [<ffffffff82339694>] down_write+0x31/0x52
       [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]
       [<ffffffffa06599f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
       [<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
       [<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
       [<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
       [<ffffffff8200296b>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 09:19:05 -07:00
Mark Fasheh 97b8f4a9df ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan
ocfs2_create_inode_in_orphan() is used by reflink to create the newly
reflinked inode simultaneously in the orphan dir. This allows us to easily
handle partially-reflinked files during recovery cleanup.

We have a problem though - the orphan dir stringifies inode # to determine
a unique name under which the orphan entry dirent can be created. Since
ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
before it can allocate the inode, we currently call into the orphan code:

       /*
        * We give the orphan dir the root blkno to fake an orphan name,
        * and allocate enough space for our insertion.
        */
       status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
                                         osb->root_blkno,
                                         orphan_name, &orphan_insert);

Using osb->root_blkno might work fine on unindexed directories, but the
orphan dir can have an index.  When it has that index, the above code fails
to allocate the proper index entry.  Later, when we try to remove the file
from the orphan dir (using the actual inode #), the reflink operation will
fail.

To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
newly split out orphan and inode alloc code to figure out what the inode
block number will be (once allocated) and then prepare the orphan dir from
that data.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:26:00 +08:00
Mark Fasheh dd43bcde23 ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions
We do this because ocfs2_create_inode_in_orphan() wants to order locking of
the orphan dir with respect to locking of the inode allocator *before*
making any changes to the directory.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:26:00 +08:00
Mark Fasheh e49e27674d ocfs2: allow return of new inode block location before allocation of the inode
This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir).  Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:59 +08:00
Mark Fasheh d51349829c ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding
ocfs2_search_chain() makes the same updates as
ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
the bitmap update, use our helper function.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:58 +08:00
Mark Fasheh 021960cab3 ocfs2: split out inode alloc code from ocfs2_mknod_locked
Do this by splitting the bulk of the function away from the inode allocation
code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
change and won't see any difference. The new function created,
__ocfs2_mknod_locked() will be used shortly.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:58 +08:00
Tristan Ye 81c8c82b5a Ocfs2: Fix a regression bug from mainline commit(6b933c8e6f).
The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2:
Avoid direct write if we fall back to buffered I/O'):

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285

The commit 6b933c8e6f changed __generic_file_aio_write
to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to  flush
the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
the O_DIRECT semantics somehow in extented odirect writes.

This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
flushed.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:57 +08:00
Jan Kara 9b4c0ff32c ocfs2: Fix deadlock when allocating page
We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.

Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:57 +08:00
Mark Fasheh b2b6ebf5f7 ocfs2: properly set and use inode group alloc hint
We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
res->sr_bg_blkno.  Unfortunately, res->sr_bg_blkno is going to be zero under
normal (non-fragmented) circumstances. The discontig block group patches
effectively turned off that feature. Fix this by correctly calculating what
the next group hint should be.

Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:56 +08:00
Tao Ma 889f004a8c ocfs2: Use the right group in nfs sync check.
We have added discontig block group now, and now an inode
can be allocated in an discontig block group. So get
it in ocfs2_get_suballoc_slot_bit.

The old ocfs2_test_suballoc_bit gets group block no
from the allocation inode which is wrong. Fix it by
passing the right group.

Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:56 +08:00
Jan Kara 04eda1a180 ocfs2: Flush drive's caches on fdatasync
When 'barrier' mount option is specified, we have to issue a cache flush
during fdatasync(2). We have to do this even if inode doesn't have
I_DIRTY_DATASYNC set because we still have to get written *data* to disk so
that they are not lost in case of crash.

Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Singed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:55 +08:00
Tao Ma f63afdb2c3 ocfs2: make __ocfs2_page_mkwrite handle file end properly.
__ocfs2_page_mkwrite now is broken in handling file end.
1. the last page should be the page contains i_size - 1.
2. the len in the last page is also calculated wrong.
So change them accordingly.

Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:55 +08:00
Sunil Mushran f5ce5a08a4 ocfs2: Fix incorrect checksum validation error
For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to
read the inode off the disk. The latter first checks to see if that block is
cached in the journal, and, if so, returns that block. That is ok.

But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum
of such blocks. Blocks that are cached in the journal may not have had their
checksum computed as yet. We should not validate the checksums of such blocks.

Fixes ossbz#1282
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Singed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:54 +08:00
Sunil Mushran dc696aced9 ocfs2: Fix metaecc error messages
Like tools, the checksum validate function now prints the values in hex.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Singed-off-by: Tao Ma <tao.ma@oracle.com>
2010-09-08 14:25:53 +08:00
Linus Torvalds a30bfd6cd4 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  O2net: Disallow o2net accept connection request from itself.
  ocfs2/dlm: remove potential deadlock -V3
  ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
  Fix the nested PR lock calling issue in ACL
  ocfs2: Count more refcount records in file system fragmentation.
  ocfs2 fix o2dlm dlm run purgelist (rev 3)
  ocfs2/dlm: fix a dead lock
  ocfs2: do not overwrite error codes in ocfs2_init_acl
2010-08-13 10:43:50 -07:00
Linus Torvalds 5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Al Viro b57922d97f convert remaining ->clear_inode() to ->evict_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:37 -04:00
Al Viro 45321ac543 Make ->drop_inode() just return whether inode needs to be dropped
... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:35 -04:00
Al Viro 066d92dcbf convert ocfs2 to ->evict_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:21 -04:00
Christoph Hellwig 2c27c65ed0 check ATTR_SIZE contraints in inode_change_ok
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok.  Also clean up and document inode_change_ok
to make this obvious.

As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error.  This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.

Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.

Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:39 -04:00
Christoph Hellwig 1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00
Christoph Hellwig eafdc7d190 sort out blockdev_direct_IO variants
Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence.  This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:29 -04:00
Linus Torvalds 09dc942c2a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Adding error check after calling ext4_mb_regular_allocator()
  ext4: Fix dirtying of journalled buffers in data=journal mode
  ext4: re-inline ext4_rec_len_(to|from)_disk functions
  jbd2: Remove t_handle_lock from start_this_handle()
  jbd2: Change j_state_lock to be a rwlock_t
  jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
  ext4: Add mount options in superblock
  ext4: force block allocation on quota_off
  ext4: fix freeze deadlock under IO
  ext4: drop inode from orphan list if ext4_delete_inode() fails
  ext4: check to make make sure bd_dev is set before dereferencing it
  jbd2: Make barrier messages less scary
  ext4: don't print scary messages for allocation failures post-abort
  ext4: fix EFBIG edge case when writing to large non-extent file
  ext4: fix ext4_get_blocks references
  ext4: Always journal quota file modifications
  ext4: Fix potential memory leak in ext4_fill_super
  ext4: Don't error out the fs if the user tries to make a file too big
  ext4: allocate stripe-multiple IOs on stripe boundaries
  ext4: move aio completion after unwritten extent conversion
  ...

Fix up conflicts in fs/ext4/inode.c as per Ted.

Fix up xfs conflicts as per earlier xfs merge.
2010-08-07 13:03:53 -07:00
Tristan Ye 415cf32c9c O2net: Disallow o2net accept connection request from itself.
Currently, o2net_accept_one() is allowed to accept a connection from
listening node itself, such a fake connection will not be successfully
established due to no handshake detected afterwards, and later end up
with triggering connecting worker in a loop.

We're going to fix this by treating such connection request as 'invalid',
since we've got no chance of requesting connection from a node to itself
in a OCFS2 cluster.

The fix doesn't hurt user's scan for o2net-listener, it always gets a
successful connection from userpace.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:50:33 -07:00
Wengang Wang b11f1f1ab7 ocfs2/dlm: remove potential deadlock -V3
When we need to take both dlm_domain_lock and dlm->spinlock, we should take
them in order of: dlm_domain_lock then dlm->spinlock.

There is pathes disobey this order. That is calling dlm_lockres_put() with
dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at
the ref and dlm_put() locks on dlm_domain_lock.

Fix:
Don't grab/put the dlm when the initialising/releasing lockres.
That grab is not required because we don't call dlm_unregister_domain()
based on refcount.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:50:30 -07:00
Wengang Wang a524812b7e ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
In the following situation, there remains an incorrect bit in refmap on the
recovery master. Finally the recovery master will fail at purging the lockres
due to the incorrect bit in refmap.

1) node A has no interest on lockres A any longer, so it is purging it.
2) the owner of lockres A is node B, so node A is sending de-ref message
to node B.
3) at this time, node B crashed. node C becomes the recovery master. it recovers
lockres A(because the master is the dead node B).
4) node A migrated lockres A to node C with a refbit there.
5) node A failed to send de-ref message to node B because it crashed. The failure
is ignored. no other action is done for lockres A any more.

For mormal, re-send the deref message to it to recovery master can fix it. Well,
ignoring the failure of deref to the original master and not recovering the lockres
to recovery master has the same effect. And the later is simpler.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:49:41 -07:00
Jiaju Zhang 845b6cf341 Fix the nested PR lock calling issue in ACL
Hi,

Thanks a lot for all the review and comments so far;) I'd like to send
the improved (V4) version of this patch.

This patch fixes a deadlock in OCFS2 ACL. We found this bug in OCFS2
and Samba integration using scenario, the symptom is several smbd
processes will be hung under heavy workload. Finally we found out it
is the nested PR lock calling that leads to this deadlock:

 node1        node2
              gr PR
                |
                V
 PR(EX)---> BAST:OCFS2_LOCK_BLOCKED
                |
                V
              rq PR
                |
                V
              wait=1

After requesting the 2nd PR lock, the process "smbd" went into D
state. It can only be woken up when the 1st PR lock's RO holder equals
zero. There should be an ocfs2_inode_unlock in the calling path later
on, which can decrement the RO holder. But since it has been in
uninterruptible sleep, the unlock function has no chance to be called.

The related stack trace is:
smbd          D ffff8800013d0600     0  9522   5608 0x00000000
 ffff88002ca7fb18 0000000000000282 ffff88002f964500 ffff88002ca7fa98
 ffff8800013d0600 ffff88002ca7fae0 ffff88002f964340 ffff88002f964340
 ffff88002ca7ffd8 ffff88002ca7ffd8 ffff88002f964340 ffff88002f964340
Call Trace:
[<ffffffff80350425>] schedule_timeout+0x175/0x210
[<ffffffff8034f580>] wait_for_common+0xf0/0x210
[<ffffffffa03e12b9>] __ocfs2_cluster_lock+0x3b9/0xa90 [ocfs2]
[<ffffffffa03e7665>] ocfs2_inode_lock_full_nested+0x255/0xdb0 [ocfs2]
[<ffffffffa0446019>] ocfs2_get_acl+0x69/0x120 [ocfs2]
[<ffffffffa0446368>] ocfs2_check_acl+0x28/0x80 [ocfs2]
[<ffffffff800e3507>] acl_permission_check+0x57/0xb0
[<ffffffff800e357d>] generic_permission+0x1d/0xc0
[<ffffffffa03eecea>] ocfs2_permission+0x10a/0x1d0 [ocfs2]
[<ffffffff800e3f65>] inode_permission+0x45/0x100
[<ffffffff800d86b3>] sys_chdir+0x53/0x90
[<ffffffff80007458>] system_call_fastpath+0x16/0x1b
[<00007f34a4ef6927>] 0x7f34a4ef6927

For details, please see:
https://bugzilla.novell.com/show_bug.cgi?id=614332 and
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1278

Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:46:46 -07:00
Tao Ma 8a2e70c40f ocfs2: Count more refcount records in file system fragmentation.
The refcount record calculation in ocfs2_calc_refcount_meta_credits
is too optimistic that we can always allocate contiguous clusters
and handle an already existed refcount rec as a whole. Actually
because of file system fragmentation, we may have the chance to split
a refcount record into 3 parts during the transaction. So consider
the worst case in record calculation.

Cc: stable@kernel.org
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:44:49 -07:00
Srinivas Eeda 7beaf24378 ocfs2 fix o2dlm dlm run purgelist (rev 3)
This patch fixes two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:44:40 -07:00
Wengang Wang 6d98c3ccb5 ocfs2/dlm: fix a dead lock
When we have to take both dlm->master_lock and lockres->spinlock,
take them in order

lockres->spinlock and then dlm->master_lock.

The patch fixes a violation of the rule.
We can simply move taking dlm->master_lock to where we have dropped res->spinlock
since when we access res->state and free mle memory we don't need master_lock's
protection.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:44:20 -07:00
Tiger Yang 6eda3dd33f ocfs2: do not overwrite error codes in ocfs2_init_acl
Setting the acl while creating a new inode depends on
the error codes of posix_acl_create_masq. This patch fix
a issue of overwriting the error codes of it.

Reported-by: Pawel Zawora <pzawora@gmail.com>
Cc: <stable@kernel.org> [ .33, .34 ]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-08-07 10:43:25 -07:00
Jiri Kosina d790d4d583 Merge branch 'master' into for-next 2010-08-04 15:14:38 +02:00
Theodore Ts'o a931da6ac9 jbd2: Change j_state_lock to be a rwlock_t
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores.  So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:35:12 -04:00
Christoph Hellwig 552ef8024f direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated. 

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Christoph Hellwig 40e2e97316 direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:09:02 -05:00
Joe Perches 33fa1d909c fs/ocfs2: Remove unnecessary casts of private_data
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-07-20 17:20:08 +02:00
Linus Torvalds bea9a6d239 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Silence gcc warning in ocfs2_write_zero_page().
  jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
  ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
  ocfs2: Don't duplicate pages past i_size during CoW.
  ocfs2: tighten up strlen() checking
  ocfs2: Make xattr reflink work with new local alloc reservation.
  ocfs2: make xattr extension work with new local alloc reservation.
  ocfs2: Remove the redundant cpu_to_le64.
  ocfs2/dlm: don't access beyond bitmap size
  ocfs2: No need to zero pages past i_size.
  ocfs2: Zero the tail cluster when extending past i_size.
  ocfs2: When zero extending, do it by page.
  ocfs2: Limit default local alloc size within bitmap range.
  ocfs2: Move orphan scan work to ocfs2_wq.
  fs/ocfs2/dlm: Add missing spin_unlock
2010-07-18 10:09:25 -07:00
Joel Becker 5453258d53 ocfs2: Silence gcc warning in ocfs2_write_zero_page().
ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
doesn't know that.  Set ret=0 just to make gcc happy.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-16 13:33:39 -07:00
Jan Kara 13ceef099e jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
OCFS2 uses t_commit trigger to compute and store checksum of the just
committed blocks. When a buffer has b_frozen_data, checksum is computed
for it instead of b_data but this can result in an old checksum being
written to the filesystem in the following scenario:

1) transaction1 is opened
2) handle1 is opened
3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
4) modify(bh)
5) journal_dirty(handle1, bh)
6) handle1 is closed
7) start committing transaction1, opening transaction2
8) handle2 is opened
9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
      jh->b_next_transaction is set to transaction2.
10) jbd2_journal_write_metadata() checksums b_frozen_data
11) the journal correctly writes b_frozen_data to the disk journal
12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
      any more journal operation
13) Checkpointing finally happens, and it just spools the bh via normal buffer
writeback.  This will write b_data, which was never triggered on and thus
contains a wrong (old) checksum.

This patch fixes the problem by calling the trigger at the moment data is
frozen for journal commit - i.e., either when b_frozen_data is created by
do_get_write_access or just before we write a buffer to the log if
b_frozen_data does not exist. We also rename the trigger to t_frozen as
that better describes when it is called.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 15:17:47 -07:00
Wengang Wang a39953dd95 ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
dlm_migration_can_proceed() for that purpose.  However, if the node is
down, dlm_migration_can_proceed() will also return "go ahead".  In this
rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
the BUG_ON() that trips over this condition.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 10:56:30 -07:00