Commit Graph

614 Commits

Author SHA1 Message Date
Linus Torvalds f6fdd7d9c2 Don't allow normal users to set idle IO priority
It has all the normal priority inversion problems.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-20 18:51:29 -07:00
Alexey Dobriyan a5ea169c95 [PATCH] freevxfs: fix breakage introduced by symlink fixes
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-20 14:30:50 -07:00
Linus Torvalds db873896d1 befs: fix up missed follow_link declaration change
We'd updated the prototype and the return value, but not the function
declaration itself.
2005-08-20 13:20:01 -07:00
Steve Dickson 01c314a0c0 [PATCH] NFSv4: unbalanced BKL in nfs_atomic_lookup()
Added missing unlock_kernel() to NFSv4 atomic lookup.

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19 18:44:56 -07:00
Al Viro 008b150a3c [PATCH] Fix up symlink function pointers
This fixes up the symlink functions for the calling convention change:

 * afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
   smbfs, sysvfs, ufs, xfs - prototype change for ->follow_link()
 * befs, smbfs, xfs - same for ->put_link()

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19 18:08:21 -07:00
Linus Torvalds cc314eef01 Fix nasty ncpfs symlink handling bug.
This bug could cause oopses and page state corruption, because ncpfs
used the generic page-cache symlink handlign functions.  But those
functions only work if the page cache is guaranteed to be "stable", ie a
page that was installed when the symlink walk was started has to still
be installed in the page cache at the end of the walk.

We could have fixed ncpfs to not use the generic helper routines, but it
is in many ways much cleaner to instead improve on the symlink walking
helper routines so that they don't require that absolute stability.

We do this by allowing "follow_link()" to return a error-pointer as a
cookie, which is fed back to the cleanup "put_link()" routine.  This
also simplifies NFS symlink handling.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19 18:02:56 -07:00
Al Viro 2fb1e3086d [PATCH] jffs2: fix symlink error handling
The current calling conventions for ->follow_link() are already fairly
complex.

What we have is
	1) you can return -error; then you must release nameidata yourself
	   and ->put_link() will _not_ be called.
	2) you can do nd_set_link(nd, ERR_PTR(-error)) and return 0
	3) you can do nd_set_link(nd, path) and return 0
	4) you can return 0 (after having moved nameidata yourself)

jffs2 follow_link() is broken - it has an exit where it returns
-EIO and leaks nameidata.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19 17:57:19 -07:00
Jan Kara d86c390ffb [PATCH] reiserfs+acl+quota deadlock fix
When i_acl_default is set to some error we do not hold the lock (hence we
are not allowed to drop it and reacquire later).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <mason@suse.com>
Cc: <reiserfs-dev@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:57 -07:00
Chuck Lever dc59250c6e [PATCH] NFS: Introduce the use of inode->i_lock to protect fields in nfsi
Down the road we want to eliminate the use of the global kernel lock entirely
from the NFS client.  To do this, we need to protect the fields in the
nfs_inode structure adequately.  Start by serializing updates to the
"cache_validity" field.

Note this change addresses an SMP hang found by njw@osdl.org, where processes
deadlock because nfs_end_data_update and nfs_revalidate_mapping update the
"cache_validity" field without proper serialization.

Test plan:
 Millions of fsx ops on SMP clients.  Run Nick Wilson's breaknfs program on
 large SMP clients.

Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:57 -07:00
Chuck Lever 412d582ec1 [PATCH] NFS: use atomic bitops to manipulate flags in nfsi->flags
Introduce atomic bitops to manipulate the bits in the nfs_inode structure's
"flags" field.

Using bitops means we can use a generic wait_on_bit call instead of an ad hoc
locking scheme in fs/nfs/inode.c, so we can remove the "nfs_i_wait" field from
nfs_inode at the same time.

The other new flags field will continue to use bitmask and logic AND and OR.
This permits several flags to be set at the same time efficiently.  The
following patch adds a spin lock to protect these flags, and this spin lock
will later cover other fields in the nfs_inode structure, amortizing the cost
of using this type of serialization.

Test plan:
 Millions of fsx ops on SMP clients.

Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:56 -07:00
Chuck Lever 5529680981 [PATCH] NFS: split nfsi->flags into two fields
Certain bits in nfsi->flags can be manipulated with atomic bitops, and some
are better manipulated via logical bitmask operations.

This patch splits the flags field into two.  The next patch introduces atomic
bitops for one of the fields.

Test plan:
 Millions of fsx ops on SMP clients.

Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:56 -07:00
Linus Torvalds 099d44e869 Merge master.kernel.org:/pub/scm/linux/kernel/git/aia21/ntfs-2.6 2005-08-17 14:56:22 -07:00
Steven Rostedt c4f92dba97 [PATCH] nfsd to unlock kernel before exiting
The nfsd holds the big kernel lock upon exit, when it really shouldn't.
Not to mention that this breaks Ingo's RT patch. This is a trivial fix
to release the lock.

Ingo, this patch also works with your kernel, and stops the problem with
nfsd.

Note, there's a "goto out;" where "out:" is right above svc_exit_thread.
The point of the goto also holds the kernel_lock, so I don't see any
problem here in releasing it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-17 12:53:05 -07:00
Linus Torvalds 5153f7e6db Merge head 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/shaggy/jfs-2.6 2005-08-16 12:12:30 -07:00
Anton Altaparmakov 481d037421 NTFS: Complete the previous fix for the unset device when mapping buffers
for  mft record writing.  I had missed the writepage based mft record
      write code path.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
2005-08-16 19:42:56 +01:00
Linus Torvalds cf59001235 Merge master.kernel.org:/pub/scm/linux/kernel/git/aia21/ntfs-2.6 2005-08-16 09:31:28 -07:00
Trond Myklebust 65e4308d25 [PATCH] NFS: Ensure we always update inode->i_mode when doing O_EXCL creates
When the client performs an exclusive create and opens the file for writing,
a Netapp filer will first create the file using the mode 01777. It does this
since an NFSv3/v4 exclusive create cannot immediately set the mode bits.
The 01777 mode then gets put into the inode->i_mode. After the file creation
is successful, we then do a setattr to change the mode to the correct value
(as per the NFS spec).

The problem is that nfs_refresh_inode() no longer updates inode->i_mode, so
the latter retains the 01777 mode. A bit later, the VFS notices this, and calls
remove_suid(). This of course now resets the file mode to inode->i_mode & 0777.
Hey presto, the file mode on the server is now magically changed to 0777. Duh...

Fixes http://bugzilla.linux-nfs.org/show_bug.cgi?id=32

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-16 09:30:58 -07:00
Trond Myklebust 58fcb8df0b [PATCH] NFS: Ensure ACL xdr code doesn't overflow.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-16 08:52:11 -07:00
Anton Altaparmakov e74589ac25 NTFS: Fix bug in mft record writing where we forgot to set the device in
the buffers when mapping them after the VM had discarded them.
      Thanks to Martin MOKREJŠ for the bug report.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
2005-08-16 16:38:28 +01:00
John McCutchan 89204c40a0 [PATCH] inotify: add MOVE_SELF event
This adds a MOVE_SELF event to inotify.  It is sent whenever the inode
you are watching is moved.  We need this event so that we can catch
something like this:

 - app1:
	watch /etc/mtab

 - app2:
	cp /etc/mtab /tmp/mtab-work
	mv /etc/mtab /etc/mtab~
	mv /tmp/mtab-work /etc/mtab

app1 still thinks it's watching /etc/mtab but it's actually watching
/etc/mtab~.

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-15 09:50:31 -07:00
Robert Love 0bf955ce98 [PATCH] inotify: fix idr_get_new_above usage
We are saving the wrong thing in ->last_wd.  We want the wd, not the
return value.

Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-15 09:48:31 -07:00
Steve French 27876d02b3 [PATCH] CIFS: Fix path name conversion for long filenames
Fix path name conversion for long filenames when mapchars mount option
was specified at mount time.

Signed-off-by: Steve French (sfrench@us.ibm.com)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-14 15:27:24 -07:00
Steve French d024709deb [PATCH] CIFS: Fix missing entries in search results
Fix missing entries in search results when very long file names and more
than 50 (or so) of such long search entries in the directory.

FindNext could send corrupt last byte of resume name when resume key was
a few hundred bytes long file name or longer.

Fixes Samba Bug # 2932

Signed-off-by: Steve French (sfrench@us.ibm.com)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-14 15:27:24 -07:00
Jan Kara 1b0a74d1c0 [PATCH] Fix error handling in reiserfs
Initialize key object ID in inode so that we don't try to remove the inode
when we fail on some checks even before we manage to allocate something.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-13 21:54:13 -07:00
Dave Kleikamp 2d610b80e9 Merge with /home/shaggy/git/linus-clean/
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
2005-08-10 11:15:13 -05:00
Dave Kleikamp 8a9cd6d676 JFS: Fix race in txLock
TxAnchor.anon_list is protected by jfsTxnLock (TXN_LOCK), but there was
a place in txLock() that was removing an entry from the list without holding
the spinlock.

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
2005-08-10 11:14:39 -05:00
John McCutchan 7a91bf7f5c [PATCH] fsnotify_name/inoderemove
The patch below unhooks fsnotify from vfs_unlink & vfs_rmdir.  It
introduces two new fsnotify calls, that are hooked in at the dcache
level.  This not only more closely matches how the VFS layer works, it
also avoids the problem with locking and inode lifetimes.

The two functions are

 - fsnotify_nameremove -- called when a directory entry is going away.
   It notifies the PARENT of the deletion.  This is called from
   d_delete().

 - inoderemove -- called when the files inode itself is going away.  It
   notifies the inode that is being deleted.  This is called from
   dentry_iput().

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-08 11:53:47 -07:00
Miklos Szeredi 68b47139ea [PATCH] namespace.c: fix bind mount from foreign namespace
I'm resending this patch, because I still believe it's the correct fix.

Tested before/after applying the patch with a test application
available from:

  http://www.inf.bme.hu/~mszeredi/nstest.c

Bind mount from a foreign namespace results in an un-removable mount.
The reason is that mnt->mnt_namespace is copied from the old mount in
clone_mnt().  Because of this check_mnt() in sys_umount() will fail.

The solution is to set mnt->mnt_namespace to current->namespace in
clone_mnt().  clone_mnt() is either called from do_loopback() or
copy_tree().  copy_tree() is called from do_loopback() or
copy_namespace().

When called (directly or indirectly) from do_loopback(), always
current->namspace is being modified: check_mnt(nd->mnt).  So setting
mnt->mnt_namespace to current->namspace is the right thing to do.

When called from copy_namespace(), the setting of mnt_namespace is
irrelevant, since mnt_namespace is reset later in that function for
all copied mounts.

Jamie said:

  This patch is correct.  The old code was buggy for more fundamental and
  serious reason: it broke the invariant that a tree of vfsmnts all have the
  same value of mnt_namespace (and the same for the mnt_list list).

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Jamie Lokier <jamie@shareable.org>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-07 10:00:38 -07:00
Andrew Morton e525e153c7 [PATCH] __bio_clone() dead comment
Remove a very wrong comment.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-07 10:00:38 -07:00
Linus Torvalds fab5a60a29 Check input buffer size in zisofs
This uses the new deflateBound() thing to sanity-check the input to the
zlib decompressor before we even bother to start reading in the blocks.

Problem noted by Tim Yamin <plasmaroo@gentoo.org>
2005-08-06 09:42:06 -07:00
John McCutchan 0c3dba1534 [PATCH] Clean up inotify delete race fix
This avoids the whole #ifdef mess by just getting a copy of
dentry->d_inode before d_delete is called - that makes the codepaths the
same for the INOTIFY/DNOTIFY cases as for the regular no-notify case.
I've been running this under a Gnome session for the last 10 minutes.
Inotify is being used extensively.

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 21:37:39 -07:00
Dave Kleikamp a5c96cab8f Merge with /home/shaggy/git/linus-clean/
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
2005-08-04 15:56:15 -05:00
John McCutchan e234f35c54 [PATCH] inotify delete race fix
The included patch fixes a problem where a inotify client would receive a
delete event before the file was actually deleted.  The bug affects both
dnotify & inotify.

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 13:11:15 -07:00
Robert Love 3de11748c1 [PATCH] inotify: update help text
The inotify help text still refers to the character device.  Update it.

Fixes kernel bug #4993.

Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 13:11:15 -07:00
Roman Zippel 74f9c9c258 [PATCH] hfs: don't reference missing page
If there was a read error, the bnode might miss some pages, so skip them.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 21:38:00 -07:00
Roman Zippel f76d28d235 [PATCH] hfs: don't dirty unchanged inode
If inode size hasn't changed, don't do anything further in truncate, which
also prevents a dirty inode, what might upset some readonly devices quite
badly.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 21:38:00 -07:00
Dave Kleikamp 30db1ae864 JFS: Check for invalid inodes in jfs_delete_inode
Some error paths may iput an invalid inode with i_nlink=0.  jfs should
not try to actually delete such an inode.

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
2005-08-01 16:54:26 -05:00
John McCutchan b9c55d29e9 [PATCH] inotify: fix race between the kernel and user space
When you rm a watch, an IN_IGNORED event is sent down the event queue
with the watch descriptor that you just rm'd.

If you then add a watch you could get the ignored watch's wd and if you
haven't read the entire event queue, user space will think that it's
newly created watch was just ignored.

To avoid this problem we just use idr_get_new_above instead of
idr_get_new.

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 09:16:53 -07:00
John McCutchan 7544953685 [PATCH] inotify: fix file deletion by rename detection
When a file is moved over an existing file that you are watching,
inotify won't send you a DELETE_SELF event and it won't unref the inode
until the inotify instance is closed by the application.

Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 09:16:53 -07:00
Maneesh Soni 9ca1eb3282 [PATCH] sysfs: fix sysfs_setattr
o sysfs_dirent's s_mode field should also be updated in sysfs_setattr(), else
  there could be inconsistency in the two fields. s_mode is used while
  ->readdir so as not to bring in the inode to cache.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-29 13:12:49 -07:00
Maneesh Soni bc062b1b5c [PATCH] sysfs: fix sysfs_chmod_file
o sysfs_chmod_file() must update the new iattr field in sysfs_dirent else
  the mode change will not be persistent in case of inode evacuation from
  cache.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-29 13:12:49 -07:00
Paolo 'Blaisorblade' Giarrusso a2d76bd8fa [PATCH] uml: implement hostfs syncing
Actually implement the hostfs "sync" method.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 21:46:05 -07:00
Andrew Morton a5453be48e [PATCH] bio_clone fix
Fix bug introduced in 2.6.11-rc2: when we clone a BIO we need to copy over the
current index into it as well.

It corrupts data with some MD setups.

See http://bugzilla.kernel.org/show_bug.cgi?id=4946

Huuuuuuuuge thanks to Matthew Stapleton <matthew4196@gmail.com> for doggedly
chasing this one down.

Acked-by: Jens Axboe <axboe@suse.de>
Cc: <linux-raid@vger.kernel.org>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 08:38:59 -07:00
Dave Kleikamp da28c12089 Merge with /home/shaggy/git/linus-clean/
/home/shaggy/git/linus-clean/
/home/shaggy/git/linus-clean/

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
2005-07-28 09:03:36 -05:00
Linus Torvalds 49302d0c42 Merge head 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/shaggy/jfs-2.6 2005-07-27 16:42:22 -07:00
Jesper Juhl 77933d7276 [PATCH] clean up inline static vs static inline
`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration.  This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).

While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:20 -07:00
Olaf Hering 44456d37b5 [PATCH] turn many #if $undefined_string into #ifdef $undefined_string
turn many #if $undefined_string into #ifdef $undefined_string to fix some
warnings after -Wno-def was added to global CFLAGS

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:08 -07:00
Andreas Gruenbacher 02b775696f [PATCH] reiserfs doesn't use mbcache
reiserfs doesn't use the mbcache, so this can go.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:07 -07:00
Andreas Gruenbacher 8c52ab42c1 [PATCH] mbcache: Remove unused mb_cache_shrink parameter
The cache parameter to mb_cache_shrink isn't used.  We may as well remove
it.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:07 -07:00
Peter Staubach c293621bbf [PATCH] stale POSIX lock handling
I believe that there is a problem with the handling of POSIX locks, which
the attached patch should address.

The problem appears to be a race between fcntl(2) and close(2).  A
multithreaded application could close a file descriptor at the same time as
it is trying to acquire a lock using the same file descriptor.  I would
suggest that that multithreaded application is not providing the proper
synchronization for itself, but the OS should still behave correctly.

SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
a file descriptor is closed, that all POSIX locks on the file, owned by the
process which closed the file descriptor, should be released.

The trick here is when those locks are released.  The current code releases
all locks which exist when close is processing, but any locks in progress
are handled when the last reference to the open file is released.

There are three cases to consider.

One is the simple case, a multithreaded (mt) process has a file open and
races to close it and acquire a lock on it.  In this case, the close will
release one reference to the open file and when the fcntl is done, it will
release the other reference.  For this situation, no locks should exist on
the file when both the close and fcntl operations are done.  The current
system will handle this case because the last reference to the open file is
being released.

The second case is when the mt process has dup(2)'d the file descriptor.
The close will release one reference to the file and the fcntl, when done,
will release another, but there will still be at least one more reference
to the open file.  One could argue that the existence of a lock on the file
after the close has completed is okay, because it was acquired after the
close operation and there is still a way for the application to release the
lock on the file, using an existing file descriptor.

The third case is when the mt process has forked, after opening the file
and either before or after becoming an mt process.  In this case, each
process would hold a reference to the open file.  For each process, this
degenerates to first case above.  However, the lock continues to exist
until both processes have released their references to the open file.  This
lock could block other lock requests.

The changes to release the lock when the last reference to the open file
aren't quite right because they would allow the lock to exist as long as
there was a reference to the open file.  This is too long.

The new proposed solution is to add support in the fcntl code path to
detect a race with close and then to release the lock which was just
acquired when such as race is detected.  This causes locks to be released
in a timely fashion and for the system to conform to the POSIX semantic
specification.

This was tested by instrumenting a kernel to detect the handling locks and
then running a program which generates case #3 above.  A dangling lock
could be reliably generated.  When the changes to detect the close/fcntl
race were added, a dangling lock could no longer be generated.

Cc: Matthew Wilcox <willy@debian.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:06 -07:00