Fix RedHat bug 329431
The idea here is separate "conscious" from "unconscious" flushes.
Conscious flushes are those due to a fsync() or close(). Unconscious
ones are flushes that occur as a side effect of some other operation or
due to memory pressure.
Currently, when an error occurs during an unconscious flush (ENOSPC or
EIO), we toss out the page and don't preserve that error to report to
the user when a conscious flush occurs. If after the unconscious flush,
there are no more dirty pages for the inode, the conscious flush will
simply return success even though there were previous errors when writing
out pages. This can lead to data corruption.
The easiest way to reproduce this is to mount up a CIFS share that's
very close to being full or where the user is very close to quota. mv
a file to the share that's slightly larger than the quota allows. The
writes will all succeed (since they go to pagecache). The mv will do a
setattr to set the new file's attributes. This calls
filemap_write_and_wait,
which will return an error since all of the pages can't be written out.
Then later, when the flush and release ops occur, there are no more
dirty pages in pagecache for the file and those operations return 0. mv
then assumes that the file was written out correctly and deletes the
original.
CIFS already has a write_behind_rc variable where it stores the results
from earlier flushes, but that value is only reported in cifs_close.
Since the VFS ignores the return value from the release operation, this
isn't helpful. We should be reporting this error during the flush
operation.
This patch does the following:
1) changes cifs_fsync to use filemap_write_and_wait and cifs_flush and also
sync to check its return code. If it returns successful, they then check
the value of write_behind_rc to see if an earlier flush had reported any
errors. If so, they return that error and clear write_behind_rc.
2) sets write_behind_rc in a few other places where pages are written
out as a side effect of other operations and the code waits on them.
3) changes cifs_setattr to only call filemap_write_and_wait for
ATTR_SIZE changes.
4) makes cifs_writepages accurately distinguish between EIO and ENOSPC
errors when writing out pages.
Some simple testing indicates that the patch works as expected and that
it fixes the reproduceable known problem.
Acked-by: Dave Kleikamp <shaggy@austin.rr.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When retrying kernel_recvmsg() because of a short read, check returned
length against the remaining length, not against total length. This
avoids unneeded session reconnects which would otherwise occur when
kernel_recvmsg() finally returns zero when asked to read zero bytes.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Have CIFS_SessSetup call cifs_get_spnego_key when Kerberos is
negotiated. Use the info in the key payload to build a session
setup request packet. Also clean up how the request buffer in
the function is freed on error.
With appropriate user space helper (in samba/source/client). Kerberos
support (secure session establishment can be done now via Kerberos,
previously users would have to use NTLMv2 instead for more secure
session setup).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...and populate it with the hostname portion of the UNC string.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move all of the kfree's sprinkled in the middle of the function to the
end, and have the code set rc and just goto there on error. Also zero
out the password string before freeing it. Looks like this should also
fix a potential memory leak of the prepath string if an error occurs
near the end of the function.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
With 64KB blocksize, a directory entry can have size 64KB which does not
fit into 16 bits we have for entry lenght. So we store 0xffff instead and
convert value when read from / written to disk. The patch also converts
some places to use ext3_next_entry() when we are changing them anyway.
[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix some warnings with SMBFS_DEBUG_* builds. This patch makes it so that
builds with -Werror don't fail.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_open / sys_read were used in the early 1.2 days to load firmware from
disk inside drivers. Since 2.0 or so this was deprecated behavior, but
several drivers still were using this. Since a few years we have a
request_firmware() API that implements this in a nice, consistent way.
Only some old ISA sound drivers (pre-ALSA) still straggled along for some
time.... however with commit c2b1239a9f the
last user is now gone.
This is a good thing, since using sys_open / sys_read etc for firmware is a
very buggy to dangerous thing to do; these operations put an fd in the
process file descriptor table.... which then can be tampered with from
other threads for example. For those who don't want the firmware loader,
filp_open()/vfs_read are the better APIs to use, without this security
issue.
The patch below marks sys_open and sys_read unused now that they're
really not used anymore, and for deletion in the 2.6.25 timeframe.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we special case when we have only the initial pid namespace.
Unfortunately in doing so the copied case for the other namespaces was
broken so we don't properly flush the thread directories :(
So this patch removes the unnecessary special case (removing a usage of
proc_mnt) and corrects the flushing of the thread directories.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is not a new problem in 2.6.23-git17. 2.6.22/2.6.23 is buggy in the
same way.
Reiserfs could accumulate dirty sub-page-size files until umount time.
They cannot be synced to disk by pdflush routines or explicit `sync'
commands. Only `umount' can do the trick.
The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
Call trace:
[<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0
[<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710
[<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530
[<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
[<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340
[<ffffffff802a187c>] __fput+0xcc/0x1b0
[<ffffffff802a1ba6>] fput+0x16/0x20
[<ffffffff8029e676>] filp_close+0x56/0x90
[<ffffffff8029fe0d>] sys_close+0xad/0x110
[<ffffffff8020c41e>] system_call+0x7e/0x83
Fix the bug by removing the cancel_dirty_page() call. Tests show that
it causes no bad behaviors on various write sizes.
=== for the patient ===
Here are more detailed demonstrations of the problem.
1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.
------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# cat > /test/tiny
[T1] hi
[T2] root /home/wfg#
------------------------------ screen 1 ------------------------------
[T1] root /home/wfg# echo /test/tiny > /proc/filecache
[T1] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___UD__Bd_ 2
[T2] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___U___Bd_ 2
2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.
------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# echo hi > /tmp/hi
[T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
[T2] hi
[T3] root /home/wfg#
------------------------------ screen 1 ------------------------------
[T1] root /proc/4397# cd /proc/`pidof cp`
[T1] root /proc/4713# cat io
rchar: 8396
wchar: 3
syscr: 20
syscw: 1
read_bytes: 0
write_bytes: 20480
cancelled_write_bytes: 4096
[T2] root /proc/4713# cat io
rchar: 8399
wchar: 6
syscr: 21
syscw: 2
read_bytes: 0
write_bytes: 24576
cancelled_write_bytes: 4096
//Question: the 'write_bytes' is a bit more than expected ;-)
Tested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Reviewed-by: Chris Mason <chris.mason@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I found a few bugs in the BFS driver. Detailed description of the bugs as
well as the steps to reproduce the errors are given in the kernel bugzilla.
Please follow these links for more information:
http://bugzilla.kernel.org/show_bug.cgi?id=9363http://bugzilla.kernel.org/show_bug.cgi?id=9364http://bugzilla.kernel.org/show_bug.cgi?id=9365http://bugzilla.kernel.org/show_bug.cgi?id=9366
This patch fixes the bugs described above. Besides, the patch introduces
coding style changes to make the BFS driver conform to the requirements
specified for Linux kernel code. Finally, I made a few cosmetic changes
such as removal of trivial debug output.
Also, the patch removes the fields `si_lf_ioff' and `si_lf_sblk' of the
in-core superblock structure. These fields are initialized but never
actually used.
If you are wondering why I need BFS, here is the answer: I am using this
driver in the context of Linux kernel classes I am teaching in the Moscow
State University and in the International Institute of Information
Technology in Pune, India.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@gmail.com>
Cc: Tigran Aivazian <tigran@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.
Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.
This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It appears we overlooked support for removing generic proc files
when we added support for multiple proc super blocks. Handle
that now.
[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Forbid user from changing file flags on quota files. User has no bussiness
in playing with these flags when quota is on. Furthermore there is a
remote possibility of deadlock due to a lock inversion between quota file's
i_mutex and transaction's start (i_mutex for quota file is locked only when
trasaction is started in quota operations) in ext3 and ext4.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: LIOU Payphone <lioupayphone@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: <reiserfs-dev@namesys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page->index should be cast to loff_t instead of off_t.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
request
In SendReceive() function in transport.c - it memcpy's
message payload into a buffer passed via out_buf param. The function
assumes that all buffers are of size (CIFSMaxBufSize +
MAX_CIFS_HDR_SIZE) , unfortunately it is also called with smaller
(MAX_CIFS_SMALL_BUFFER_SIZE) buffers. There are eight callers
(SMB worker functions) which are primarily affected by this change:
TreeDisconnect, uLogoff, Close, findClose, SetFileSize, SetFileTimes,
Lock and PosixLock
CC: Dave Kleikamp <shaggy@austin.ibm.com>
CC: Przemyslaw Wegrzyn <czajnik@czajsoft.pl>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This reverts commit 7c9e69faa2, fixing up
conflicts in fs/ext4/balloc.c manually.
The cost of doing the bitmap validation on each lookup - even when the
bitmap is cached - is absolutely prohibitive. We could, and probably
should, do it only when adding the bitmap to the buffer cache. However,
right now we are better off just reverting it.
Peter Zijlstra measured the cost of this extra validation as a 85%
decrease in cached iozone, and while I had a patch that took it down to
just 17% by not being _quite_ so stupid in the validation, it was still
a big slowdown that could have been avoided by just doing it right.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reverts Eric's commit 2b008b0a8e
It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.
Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.
Looking at the sock->op->shutdown() handlers, it looks as if all of them
take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
Add a helper, and then define the SHUT_* enum to ensure that kernel users
of shutdown() don't get confused.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with commit 7fc90ec93a ("knfsd: nfsd:
call nfsd_setuser() on fh_compose(), fix nfsd4 permissions problem")
this is a case where we need to redo a security check in fh_verify()
even though the filehandle already has an associated dentry--if the
filehandle was created by fh_compose() in an earlier operation of the
nfsv4 compound, then we may not have done these checks yet.
Without this fix it is possible, for example, to traverse from an export
without the secure ports requirement to one with it in a single
compound, and bypass the secure port check on the new export.
While we're here, fix up some minor style problems and change a printk()
to a dprintk(), to make it harder for random unprivileged users to spam
the logs.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-By: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The v2/v3 acl code in nfsd is translating any return from fh_verify() to
nfserr_inval. This is particularly unfortunate in the case of an
nfserr_dropit return, which is an internal error meant to indicate to
callers that this request has been deferred and should just be dropped
pending the results of an upcall to mountd.
Thanks to Roland <devzero@web.de> for bug report and data collection.
Cc: Roland <devzero@web.de>
Acked-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-By: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (21 commits)
[CIFS] fix oops on second mount to same server when null auth is used
[CIFS] Fix stale mode after readdir when cifsacl specified
[CIFS] add mode to acl conversion helper function
[CIFS] Fix incorrect mode when ACL had deny access control entries
[CIFS] Add uid to key description so krb can handle user mounts
[CIFS] Fix walking out end of cifs dacl
[CIFS] Add upcall files for cifs to use spnego/kerberos
[CIFS] add OIDs for KRB5 and MSKRB5 to ASN1 parsing routines
[CIFS] Register and unregister cifs_spnego_key_type on module init/exit
[CIFS] implement upcalls for SPNEGO blob via keyctl API
[CIFS] allow cifs_calc_signature2 to deal with a zero length iovec
[CIFS] If no Access Control Entries, set mode perm bits to zero
[CIFS] when mount helper missing fix slash wrong direction in share
[CIFS] Don't request too much permission when reading an ACL
[CIFS] enable get mode from ACL when cifsacl mount option specified
[CIFS] ACL support part 8
[CIFS] acl support part 7
[CIFS] acl support part 6
[CIFS] acl support part 6
[CIFS] remove unused funtion compile warning when experimental off
...
The coredump code always calls set_dumpable(0) when it starts (even
if RLIMIT_CORE prevents any core from being dumped). The effect of
this (via task_dumpable) is to make /proc/pid/* files owned by root
instead of the user, so the user can no longer examine his own
process--in a case where there was never any privileged data to
protect. This affects e.g. auxv, environ, fd; in Fedora (execshield)
kernels, also maps. In practice, you can only notice this when a
debugger has requested PTRACE_EVENT_EXIT tracing.
set_dumpable was only used in do_coredump for synchronization and not
intended for any security purpose. (It doesn't secure anything that wasn't
already unsecured when a process dies by SIGTERM instead of SIGQUIT.)
This changes do_coredump to check the core_waiters count as the means of
synchronization, which is sufficient. Now we leave the "dumpable" bits alone.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a share is mounted using no username, cifs_mount sets
volume_info.username as a NULL pointer, and the sesInfo userName as an
empty string. The volume_info.username is passed to a couple of other
functions to see if there is an existing unc or tcp connection that can
be used. These functions assume that the username will be a valid
string that can be passed to strncmp. If the pointer is NULL, then the
kernel will oops if there's an existing session to which the string
can be compared.
This patch changes cifs_mount to set volume_info.username to an empty
string in this situation, which prevents the oops and should make it
so that the comparison to other null auth sessions match.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When mounted with cifsacl mount option, readdir can not
instantiate the inode with the estimated mode based on the ACL
for each file since we have not queried for the ACL for
each of these files yet. So set the refresh time to zero
for these inodes so that the next stat will cause the client
to go to the server for the ACL info so we can build the estimated
mode (this means we also will issue an extra QueryPathInfo if
the stat happens within 1 second, but this is trivial compared to
the time required to open/getacl/close for each).
ls -l is slower when cifsacl mount option is specified, but
displays correct mode information.
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When mounted with the cifsacl mount option, we were
treating any deny ACEs found like allow ACEs and it turns out for
SFU and SUA Windows set these type of access control entries often.
The order of ACEs is important too. The canonical order that most
ACL tools and Windows explorer consruct ACLs with is to begin with
DENY entries then follow with ALLOW, otherwise an allow entry
could be encountered first, making the subsequent deny entry like "dead
code which would be superflous since Windows stops when a match is
made for the operation you are trying to perform for your user
We start with no permissions in the mode and build up as we find
permissions (ie allow ACEs). This fixes deny ACEs so they affect
the mask used to set the subsequent allow ACEs.
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Alexander Bokovoy <ab@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Adds uid to key description fro supporting user mounts
and minor formating changes
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Normally io priorities follow the CPU nice, unless a specific scheduling
class has been set. Once that is set, there's no way to reset the
behaviour to 'none' so that it follows CPU nice again.
Currently passing in 0 as the ioprio class/value will return -1/EINVAL,
change that to allow resetting of a set scheduling class.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If another node unlinks the destination while ocfs2_rename() is waiting on a
cluster lock, ocfs2_rename() simply logs an error and continues. This causes
a crash because the renaming node is now trying to delete a non-existent
inode. The correct solution is to return -ENOENT.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We should subtract start of our IO from PAGE_CACHE_SIZE to get the right
length of the write we want to perform.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
On file systems which don't support sparse files, Ocfs2_map_page_blocks()
was reading blocks on appending writes. This caused write performance to
suffer dramatically. Fix this by detecting an appending write on a nonsparse
fs and skipping the read.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We're missing a meta data commit for extending sync writes. In thoery, write
could return with the meta data required to read the data uncommitted to
disk. Fix that by detecting an allocating write and forcing a journal commit
in the sync case.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Do this to avoid a theoretical (I haven't seen this in practice) race where
the downconvert thread might drop the dentry lock, allowing a remote unlink
to proceed before dropping the inode locks. This could bounce access to the
orphan dir between nodes.
There doesn't seem to be a need to do the same in ocfs2_dentry_iput() as
that's never called for the last ref drop from the downconvert thread.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
If we have not yet created a cluster lock, ocfs2_cluster_lock() will
first create it at NLMODE, and then convert the lock to either PRMODE or
EXMODE (whichever is requested).
Change ocfs2_cluster_lock() to just create the lock at the initially
requested level. ocfs2_locking_ast() handles this case fine, so the only
update required was in setup of locking state. This should reduce the number
of network messages required for a new lock by one, providing an incremental
performance enhancement.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>