Commit Graph

9376 Commits

Author SHA1 Message Date
Linus Torvalds a3da5bf84a Merge branch 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits)
  x86: make 64bit hpet_set_mapping to use ioremap too, v2
  x86: get x86_phys_bits early
  x86: max_low_pfn_mapped fix #4
  x86: change _node_to_cpumask_ptr to return const ptr
  x86: I/O APIC: remove an IRQ2-mask hack
  x86: fix numaq_tsc_disable calling
  x86, e820: remove end_user_pfn
  x86: max_low_pfn_mapped fix, #3
  x86: max_low_pfn_mapped fix, #2
  x86: max_low_pfn_mapped fix, #1
  x86_64: fix delayed signals
  x86: remove conflicting nx6325 and nx6125 quirks
  x86: Recover timer_ack lost in the merge of the NMI watchdog
  x86: I/O APIC: Never configure IRQ2
  x86: L-APIC: Always fully configure IRQ0
  x86: L-APIC: Set IRQ0 as edge-triggered
  x86: merge dwarf2 headers
  x86: use AS_CFI instead of UNWIND_INFO
  x86: use ignore macro instead of hash comment
  x86: use matching CFI_ENDPROC
  ...
2008-07-14 13:43:24 -07:00
Linus Torvalds 847106ff62 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits)
  security: remove register_security hook
  security: remove dummy module fix
  security: remove dummy module
  security: remove unused sb_get_mnt_opts hook
  LSM/SELinux: show LSM mount options in /proc/mounts
  SELinux: allow fstype unknown to policy to use xattrs if present
  security: fix return of void-valued expressions
  SELinux: use do_each_thread as a proper do/while block
  SELinux: remove unused and shadowed addrlen variable
  SELinux: more user friendly unknown handling printk
  selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine)
  SELinux: drop load_mutex in security_load_policy
  SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av
  SELinux: open code sidtab lock
  SELinux: open code load_mutex
  SELinux: open code policy_rwlock
  selinux: fix endianness bug in network node address handling
  selinux: simplify ioctl checking
  SELinux: enable processes with mac_admin to get the raw inode contexts
  Security: split proc ptrace checking into read vs. attach
  ...
2008-07-14 13:36:55 -07:00
Linus Torvalds dddec01eb8 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
  splice: fix generic_file_splice_read() race with page invalidation
  ramfs: enable splice write
  drivers/block/pktcdvd.c: avoid useless memset
  cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
  scsi: sr avoids useless buffer allocation
  block: blk_rq_map_kern uses the bounce buffers for stack buffers
  block: add blk_queue_update_dma_pad
  DAC960: push down BKL
  pktcdvd: push BKL down into driver
  paride: push ioctl down into driver
  block: use get_unaligned_* helpers
  block: extend queue_flag bitops
  block: request_module(): use format string
  Add bvec_merge_data to handle stacked devices and ->merge_bvec()
  block: integrity flags can't use bit ops on unsigned short
  cmdfilter: extend default read filter
  sg: fix odd style (extra parenthesis) introduced by cmd filter patch
  block: add bounce support to blk_rq_map_user_iov
  cfq-iosched: get rid of enable_idle being unused warning
  allow userspace to modify scsi command filter on per device basis
  ...
2008-07-14 13:15:14 -07:00
Artem Bityutskiy 4ee6afd344 VFS: export sync_sb_inodes
This patch exports the 'sync_sb_inodes()' which is needed for
UBIFS because it has to force write-back from time to time.
Namely, the UBIFS budgeting subsystem forces write-back when
its pessimistic callculations show that there is no free
space on the media.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14 19:10:52 +03:00
Hans Reiser ae8547b0a9 VFS: move inode_lock into sync_sb_inodes
This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather
than expect that the caller will do this.

This change was previously done by Hans Reiser <reiser@namesys.com>
and sat in the -mm tree.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14 19:10:52 +03:00
Ingo Molnar d59fdcf2ac Merge commit 'v2.6.26' into x86/core 2008-07-14 11:37:46 +02:00
Eric Paris 2069f45784 LSM/SELinux: show LSM mount options in /proc/mounts
This patch causes SELinux mount options to show up in /proc/mounts.  As
with other code in the area seq_put errors are ignored.  Other LSM's
will not have their mount options displayed until they fill in their own
security_sb_show_options() function.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: James Morris <jmorris@namei.org>
2008-07-14 15:02:05 +10:00
Stephen Smalley 006ebb40d3 Security: split proc ptrace checking into read vs. attach
Enable security modules to distinguish reading of process state via
proc from full ptrace access by renaming ptrace_may_attach to
ptrace_may_access and adding a mode argument indicating whether only
read access or full attach access is requested.  This allows security
modules to permit access to reading process state without granting
full ptrace access.  The base DAC/capability checking remains unchanged.

Read access to /proc/pid/mem continues to apply a full ptrace attach
check since check_mem_permission() already requires the current task
to already be ptracing the target.  The other ptrace checks within
proc for elements like environ, maps, and fds are changed to pass the
read mode instead of attach.

In the SELinux case, we model such reading of process state as a
reading of a proc file labeled with the target process' label.  This
enables SELinux policy to permit such reading of process state without
permitting control or manipulation of the target process, as there are
a number of cases where programs probe for such information via proc
but do not need to be able to control the target (e.g. procps,
lsof, PolicyKit, ConsoleKit).  At present we have to choose between
allowing full ptrace in policy (more permissive than required/desired)
or breaking functionality (or in some cases just silencing the denials
via dontaudit rules but this can hide genuine attacks).

This version of the patch incorporates comments from Casey Schaufler
(change/replace existing ptrace_may_attach interface, pass access
mode), and Chris Wright (provide greater consistency in the checking).

Note that like their predecessors __ptrace_may_attach and
ptrace_may_attach, the __ptrace_may_access and ptrace_may_access
interfaces use different return value conventions from each other (0
or -errno vs. 1 or 0).  I retained this difference to avoid any
changes to the caller logic but made the difference clearer by
changing the latter interface to return a bool rather than an int and
by adding a comment about it to ptrace.h for any future callers.

Signed-off-by:  Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-07-14 15:01:47 +10:00
Jeff Layton 536abdb080 cifs: fix wksidarr declaration to be big-endian friendly
The current definition of wksidarr works fine on little endian arches
(since cpu_to_le32 is a no-op there), but on big-endian arches, it fails
to compile with this error:

error: braced-group within expression allowed only inside a function

The problem is that this static declaration has cpu_to_le32 embedded
within it, and that expands into a function macro.  We need to use
__constant_cpu_to_le32() instead.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12 14:33:42 -07:00
Jeff Layton e911d0cc87 cifs: fix inode leak in cifs_get_inode_info_unix
Try this:

    mount a share with unix extensions
    create a file on it
    umount the share

You'll get the following message in the ring buffer:

VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds.  Have a
nice day...

...the problem is that cifs_get_inode_info_unix is creating and hashing
a new inode even when it's going to return error anyway. The first
lookup when creating a file returns an error so we end up leaking this
inode before we do the actual create. This appears to be a regression
caused by commit 0e4bbde94f.

The following patch seems to fix it for me, and fixes a minor
formatting nit as well.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12 14:33:42 -07:00
Ingo Molnar ae94b8075a Merge branch 'linus' into x86/core
Conflicts:

	arch/x86/mm/ioremap.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-12 07:29:02 +02:00
Eric Sandeen e4079a11f5 ext4: do not set extents feature from the kernel
We've talked for a while about getting rid of any feature-
setting from the kernel; this gets rid of the code which would
set the INCOMPAT_EXTENTS flag on the first file write when mounted
as ext4[dev].

With this patch, if the extents feature is not already set on disk,
then mounting as ext4 will fall back to noextents with a warning,
and if -o extents is explicitly requested, the mount will fail,
also with warning.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V c07651b556 ext4: Don't allow nonextenst mount option for large filesystem
The block mapped inode format can address only blocks within 2**32. This
causes a number of issues, the biggest of which is that the block
allocator needs to be taught that certain inodes can not utilize block
numbers > 2**32.  So until this is fixed, it is simplest to fail
mounting of file systems with more than 2**32 blocks if the -o noextents
option is given.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V dd919b9822 ext4: Enable delalloc by default.
Enable delalloc by default to ensure it gets sufficient testing and
because it makes the filesystem much more efficient.  Add a nodealalloc
option to disable delayed allocation, and update ext4_show_options to
show delayed allocation off if it is disabled.

If the data=journal mount option is used, disable delayed allocation
since the delalloc code doesn't support data=journal yet.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Mingming Cao 3e3398a08d ext4: delayed allocation i_blocks fix for stat
Right now i_blocks is not getting updated until the blocks are actually
allocaed on disk.  This means with delayed allocation, right after files
are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
also shows the files taking zero space, which is highly confusing to the
user.

Since delayed allocation already keeps track of per-inode total
number of blocks that are subject to delayed allocation, this patch fix
this by using that to adjust the value returned by stat(2). When real
block allocation is done, the i_blocks will get updated. Since the
reserved blocks for delayed allocation will be decreased, this will be
keep value returned by stat(2) consistent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao 632eaeab1f ext4: fix delalloc i_disksize early update issue
Ext4_da_write_end() used walk_page_buffers() with a callback function of
ext4_bh_unmapped_or_delay() to check if it extended the file size
without allocating any blocks (since in this case i_disksize needs to be
updated).  However, this is didn't work proprely because the buffer head
has not been marked dirty yet --- this is done later in
block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
always return false.

In addition, walk_page_buffers() checks all of the buffer heads covering
the page, and the only buffer_head that should be checked is the one
covering the end of the write.  Otherwise, given a 1k blocksize
filesystem and a 4k page size, the buffer head covering the first 1k
stripe of the file could be unmapped (because it was a sparse file), and
the second or third buffer_head covering that page could be mapped, and
using walk_page_buffers() would fail in this case since it would stop at
the first unmapped buffer_head and return true.

The core problem is that walk_page_buffers() was intended to do work in
a callback function, and a non-zero return value indicated a failure,
which termined the walk of the buffer heads covering the page.  It was
not intended to be used with a boolean function, such as
ext4_bh_unmapped_or_delay().

Add addtional fix from Aneesh to protect i_disksize update rave with truncate.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V f0e6c98593 ext4: Handle page without buffers in ext4_*_writepage()
It can happen that buffers are removed from the page before it gets
marked dirty and then is passed to writepage().  In writepage() we just
initialize the buffers and check whether they are mapped and non
delay. If they are mapped and non delay we write the page. Otherwise we
mark them dirty.  With this change we don't do block allocation at all
in ext4_*_write_page.

writepage() can get called under many condition and with a locking order
of journal_start -> lock_page, we should not try to allocate blocks in
writepage() which get called after taking page lock.  writepage() can
get called via shrink_page_list even with a journal handle which was
created for doing inode update.  For example when doing
ext4_da_write_begin we create a journal handle with credit 1 expecting a
i_disksize update for the inode. But ext4_da_write_begin can cause
shrink_page_list via _grab_page_cache. So having a valid handle via
ext4_journal_current_handle is not a guarantee that we can use the
handle for block allocation in writepage, since we shouldn't be using
credits that had been reserved for other updates.  That it could result
in we running out of credits when we update inodes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V cd1aac3292 ext4: Add ordered mode support for delalloc
This provides a new ordered mode implementation which gets rid of using
buffer heads to enforce the ordering between metadata change with the
related data chage.  Instead, in the new ordering mode, it keeps track
of all of the inodes touched by each transaction on a list, and when
that transaction is committed, it flushes all of the dirty pages for
those inodes.  In addition, the new ordered mode reverses the lock
ordering of the page lock and transaction lock, which provides easier
support for delayed allocation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao 61628a3f3a ext4: Invert lock ordering of page_lock and transaction start in delalloc
With the reverse locking, we need to start a transation before taking
the page lock, so in ext4_da_writepages() we need to break the write-out
into chunks, and restart the journal for each chunck to ensure the
write-out fits in a single transaction.

Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
which fixes delalloc sync hang with journal lock inversion, and address
the performance regression issue.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao d2a1763791 ext4: delayed allocation ENOSPC handling
This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.
        
At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-14 17:52:37 -04:00
Mingming Cao e8ced39d5e percpu_counter: new function percpu_counter_sum_and_set
Delayed allocation need to check free blocks at every write time.
percpu_counter_read_positive() is not quit accurate. delayed
allocation need a more accurate accounting, but using
percpu_counter_sum_positive() is frequently is quite expensive.

This patch added a new function to update center counter when sum
per-cpu counter, to increase the accurate rate for next
percpu_counter_read() and require less calling expensive
percpu_counter_sum().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Alex Tomas 64769240bd ext4: Add delayed allocation support in data=writeback mode
Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and
release the page from page cache if the delalloc write_begin failed, and
properly handle preallocated blocks.  Also added a fix to clear
buffer_delay in block_write_full_page() after allocating a delayed
buffer.

Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to update i_disksize properly and to add bmap support for delayed
allocation.

Updated with a fix from Valerie Clement <valerie.clement@bull.net> to
avoid filesystem corruption when the filesystem is mounted with the
delalloc option and blocksize < pagesize.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by:  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-07-11 19:27:31 -04:00
Alex Tomas 29a814d2ee vfs: add hooks for ext4's delayed allocation support
Export mpage_bio_submit() and __mpage_writepage() for the benefit of
ext4's delayed allocation support.   Also change __block_write_full_page
so that if buffers that have the BH_Delay flag set it will call
get_block() to get the physical block allocated, just as in the
!BH_Mapped case.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara 87c89c232c jbd2: Remove data=ordered mode support using jbd buffer heads
Signed-off-by: Jan Kara <jack@suse.cz>
2008-07-11 19:27:31 -04:00
Jan Kara 678aaf4814 ext4: Use new framework for data=ordered mode in JBD2
This patch makes ext4 use inode-based implementation of data=ordered mode
in JBD2. It allows us to unify some data=ordered and data=writeback paths
(especially writepage since we don't have to start a transaction anymore)
and remove some buffer walking.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix file system hang due to corrupt jinode values.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara c851ed5401 jbd2: Implement data=ordered mode handling via inodes
This patch adds necessary framework into JBD2 to be able to track inodes
with each transaction and write-out their dirty data during transaction
commit time.

This new ordered mode brings all sorts of advantages such as possibility 
to get rid of journal heads and buffer heads for data buffers in ordered 
mode, better ordering of writes on transaction commit, simplification of
 some JBD code, no more anonymous pages when truncate of data being 
committed happens.  Also with this new ordered mode, delayed allocation 
on ordered mode is much simpler.

Signed-off-by: Jan Kara <jack@suse.cz>
2008-07-11 19:27:31 -04:00
Jan Kara 9ddfc3dc75 ext4: Fix lock inversion in ext4_ext_truncate()
We cannot call ext4_orphan_add() from under i_data_sem because that
causes a lock ordering violation between i_data_sem and and the
superblock lock.

Updated with Aneesh's locking order fix

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com> 
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara cf108bca46 ext4: Invert the locking order of page_lock and transaction start
This changes are needed to support data=ordered mode handling via
inodes.  This enables us to get rid of the journal heads and buffer
heads for data buffers in the ordered mode.  With the changes, during
tranasaction commit we writeout the inode pages using the
writepages()/writepage(). That implies we take page lock during
transaction commit. This can cause a deadlock with the locking order
page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait
for the journal_commit to happen and the journal_commit now needs to
take the page lock. To avoid this dead lock reverse the locking order.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara c7d206b337 vfs: Move mark_inode_dirty() from under page lock in generic_write_end()
There's no need to call mark_inode_dirty() under page lock in
generic_write_end(). It unnecessarily makes hold time of page lock longer
and more importantly it forces locking order of page lock and transaction
start for journaling filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V 2e9ee85035 ext4: Use page_mkwrite vma_operations to get mmap write notification.
We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss. The changes are also needed to handle ENOSPC
when writing to an mmap section of files with holes.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Frederic Bohe 5f21b0e642 ext4: fix online resize with mballoc
Update group infos when updating a group's descriptor.
Add group infos when adding a group's descriptor.
Refresh cache pages used by mb_alloc when changes occur.
This will probably need modifications when META_BG resizing will be allowed.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Eric Sandeen 953e622b60 ext4: use atomic functions to set bh_state
Use the BUFFER_FNS functions (set_buffer_foo) to set buffer
head state atomically instead of nonatomic __set_bit().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara 47b4a50beb ext4: Set journal pointer to NULL when journal is released
Set sbi->s_journal to NULL after we call journal_destroy(). This
will be later needed because after journal_destroy() is called,
ext4_clear_inode() can still be called for some inodes (e.g. root
inode) and we'll need to detect there that journal doesn't exists
anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao 0703143107 ext4: mballoc avoid use root reserved blocks for non root allocation
mballoc allocation missed check for blocks reserved for root users. Add
ext4_has_free_blocks() check before allocation. Also modified
ext4_has_free_blocks() to support multiple block allocation request.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Eric Sandeen d755fb3842 ext4: call blkdev_issue_flush on fsync
To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V 654b4908bc ext4: cleanup block allocator
Move the code related to block allocation to a single function and add helper
funtions to differient allocation for data and meta data blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V 7061eba75c ext4: Use inode preallocation with -o noextents
When mballoc is enabled, block allocation for old block-based
files are allocated using mballoc allocator instead of old
block-based allocator. The old ext3 block reservation is turned
off when mballoc is turned on.

However, the in-core preallocation is not enabled for block-based/
non-extent based file block allocation. This result in performance
regression, as now we don't have "reservation" ore in-core preallocation
to prevent interleaved fragmentation in multiple writes workload.

This patch fix this by enable per inode in-core preallocation
for non extent files when mballoc is used.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Akinobu Mita 6afd670713 ext4: fix ext4_init_block_bitmap() for metablock block group
When meta_bg feature is enabled and s_first_meta_bg != 0, 
ext4_init_block_bitmap() miscalculates the number of block used by
the group descriptor table (0 or 1 for metablock block group)

This patch fixes this by using ext4_bg_num_gdb()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Andreas Dilger <adilger@sun.com>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V 7477827f66 ext4: Fix sparse warning
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan d9c769b769 ext4: cleanup never-used magic numbers from htree code
dx_root_limit() will had some dead code which forced it to always return
20, and dx_node_limit to always return 22 for debugging purposes.
Remove it.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 9102e4fa80 ext4: Fix ext4_ext_journal_restart() to reflect errors up to the caller
Fix ext4_ext_journal_restart() so it returns any errors reported by 
ext4_journal_extend() and ext4_journal_restart().

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 1973adcba5 ext4: Make ext4_ext_find_extent fills ext_path completely
When pos=0 or depth, the fields of ext4_ext_path is are not
completely filled.  This patch also removes some unnecessary code.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 787e0981fa ext4: return error when calling ext4_ext_split failed
ext4_ext_create_new_leaf must return error when its
calling to ext4_ext_split failed.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V a379cd1d6b ext4: Update i_disksize properly when allocating from fallocate area.
When allocating unitialized space at the end of file which had been
preallocated with the FALLOC_FL_KEEP_SIZE option, the file size is not
updated at that time.  But the later we are not updating the file size
when writing to that preallocated space.

These changes are for code correctness.  This patch allows us to update
the i_disksize at the write_end() callback of filesystem properly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 363d4251d4 ext4: remove quota allocation when ext4_mb_new_blocks fails
Quota allocation is not removed when ext4_mb_new_blocks calls
kmem_cache_alloc failed.  Also make sure the allocation context is freed
on the error path.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan f9a8ac99fd ext4: remove redundant code in ext4_fill_super()
The previous sb_min_blocksize() has already set the block size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao 530576bbf3 jbd2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction
journal_try_to_free_buffers() could race with jbd commit transaction
when the later is holding the buffer reference while waiting for the
data buffer to flush to disk. If the caller of
journal_try_to_free_buffers() request tries hard to release the buffers,
it will treat the failure as error and return back to the caller. We
have seen the directo IO failed due to this race.  Some of the caller of
releasepage() also expecting the buffer to be dropped when passed with
GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating
this call could wait, in case of try_to_free_buffers() failed, let's
waiting for journal_commit_transaction() to finish commit the current
committing transaction , then try to free those buffers again with
journal locked.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-13 21:06:39 -04:00
Jose R. Santos 772cb7c83b ext4: New inode allocation for FLEX_BG meta-data groups.
This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o 736603ab29 jbd2: Add commit time into the commit block
Carlo Wood has demonstrated that it's possible to recover deleted
files from the journal.  Something that will make this easier is if we
can put the time of the commit into commit block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Stoyan Gaydarov 4db9c54a53 ext4: replace __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__ instead

Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:29 -04:00
Shen Feng 7e5a8cdd84 ext4: fix error processing in mb_free_blocks
The error processing of the return value of mb_free_blocks is meanless
because it only returns 0.  This fix includes

- make mb_free_blocks return void

- remove the error processing part in callers

- unlock group before calling ext4_error in mb_free_blocks

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:31 -04:00
Shen Feng cfbe7e4f5e ext4: error proc entry creation when the fs/ext4 is not correctly created
When the directory fs/ext4 is not correctly created under proc, the entry
under this directory should not be created.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:31 -04:00
Li Zefan f795e14073 ext4: fix build failure if DX_DEBUG is enabled
ext4_next_entry() is used by the debugging function dx_show_leaf(), so
it must be defined before that function.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o 7ad72ca60b ext4: Remove unused variable from ext4_show_options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o 574ca174c9 ext4: Rename read_block_bitmap() to ext4_read_block_bitmap()
Since this a non-static function, make it be ext4 specific to avoid
conflicts with potentially other filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 3537576a70 ext4: remove double definitions of xattr macros
remove the definitions of macros XATTR_TRUSTED_PREFIX and XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 74767c5a2d ext4: miscellaneous error checks and coding cleanups for mballoc
ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL

ext4_mb_history_init(): replace kmalloc and memset with kzalloc

ext4_mb_init_backend(): remove memset since kzalloc is used

ext4_mb_init(): the return value of ext4_mb_init_backend is int,
	but i is unsigned, replace it with a new int variable.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng fdf6c7a768 ext4: add error processing when calling ext4_mb_init_cache in mballoc
Add error processing for ext4_mb_load_buddy when it calls
ext4_mb_init_cache.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by:   Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao 31b481dc7c ext4: Fix ext4_mb_init_cache return error
ext4_mb_init_cache() incorrectly always return EIO on success. This
causes the caller of ext4_mb_init_cache() fail when it checks the return
value.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 69baee062a ext4: improve some code in rb tree part of dir.c
* remove unnecessary code in free_rb_tree_fname

* rename free_rb_tree_fname to ext4_htree_create_dir_info
  since it and ext4_htree_free_dir_info are a pair

* replace kmalloc with kzalloc in ext4_htree_free_dir_info

All these make the code more readable and simple.
PS: this patch is also suitable for ext3.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Alexey Dobriyan 91d9982779 ext4: switch to seq_files
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-07-11 19:27:31 -04:00
Julia Lawall 07d45f1267 ext4: Use BUG_ON() instead of BUG()
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@ disable unlikely @ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)

@@ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V ed8f9c751f ext4: start searching for the right extent from the goal group.
With mballoc we search for the best extent using different
criteria. We should always use the goal group when we are
starting with a new criteria.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng 8a35694e11 ext4: fix comments to say "ext4"
Change second/third to fourth.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V e7dfb2463e ext4: Fix mb_find_next_bit not to return larger than max
Some architectures implement ext4_find_next_bit and
ext4_find_next_zero_bit in such a way that they return
greater than max for some input values. Make sure
mb_find_next_bit and mb_find_next_zero_bit return the
right values.

On 2.6.25 we have include/asm-x86/bitops_32.h
static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
{
	unsigned x = 0;

	while (x < size) {
		unsigned long val = *addr++;
		if (val)
			return __ffs(val) + x;
		x += (sizeof(*addr)<<3);
	}
	return x;
}

This can return value greater than size.

Reported and fixed here for lustre

https://bugzilla.lustre.org/show_bug.cgi?id=15932
https://bugzilla.lustre.org/attachment.cgi?id=17205

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin f3b35f063e ext4: validate directory entry data before use
ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is
valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop
to check the validity of entries before checking whether they match and
moving onto the next one.

There are other uses of ext4_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a test-case
that triggers them.

This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin 71dc8fbcf5 ext4: handle deleting corrupted indirect blocks
While freeing indirect blocks we attach a journal head to the parent buffer
head, free the blocks, then journal the parent. If the indirect block list
is corrupted and points to the parent the journal head will be detached
when the block is cleared, causing an OOPS.

Check for that explicitly and handle it gracefully.

This patch fixes the third case (image hdb.20000057.nullderef.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin 91ef4caf80 ext4: handle corrupted orphan list at mount
If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext4_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext4_orphan_get function.

This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Dave Chinner 49641f1acf Fix reference counting race on log buffers
When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout.  However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there.  If we do, we release
the log buffer, otherwise we decrement the reference count.

But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog.  That leads to a filesystem hang.

Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Tim Shimmin <tes@sgi.com>
Tested-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-11 11:37:18 -07:00
David Howells 4abaca17e7 [GFS2] Fix GFS2's use of do_div() in its quota calculations
Fix GFS2's need_sync()'s use of do_div() on an s64 by using div_s64() instead.

This does assume that gt_quota_scale_den can be cast to an s32.

This was introduced by patch b3b94faa5f.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-11 14:35:01 +01:00
Hugh Dickins 96a8e13ed4 exec: fix stack excutability without PT_GNU_STACK
Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
exec'ing an ELF without a PT_GNU_STACK program header should default to an
executable stack; but this got broken by the unlimited argv feature because
stack vma is now created before the right personality has been established:
so breaking old binaries using nested function trampolines.

Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
vm_flags used to be set, before the mprotect_fixup.  Checking through
our existing VM_flags, none would have changed since insert_vm_struct:
so this seems safer than finding a way through the personality labyrinth.

Reported-by: pageexec@freemail.hu
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-10 13:25:43 -07:00
Mark Fasheh e988cf1cfe ocfs2: Fix flags in ocfs2_file_lock
The stack-glue merge changed the way we use flags in dlmglue in that we now
use the fs/dlm equivalents. Unfortunately, a merge error left the new flock
code only partially updated. This took a while to show up though, because
the lock level constants are actually identical between o2dlm and fs/dlm.
The *_CONVERT and *_NOQUEUE flags have different values though, which is
eventually causing a crash in flags_to_o2dlm().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-10 09:25:39 -07:00
Li Xiaodong a93a6ce242 [GFS2] Remove unused declaration
The implementation of gfs2_inode_attr_in is removed.
So remove its declaration.

Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 16:22:23 +01:00
Steven Whitehouse c9f6a6bbc2 [GFS2] Remove support for unused and pointless flag
The ability to mark files for direct i/o access when opened
normally is both unused and pointless, so this patch removes
support for that feature.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 16:09:29 +01:00
Steven Whitehouse 9cabcdbd46 [GFS2] Replace rgrp "recent list" with mru list
This patch removes the "recent list" which is used during allocation
and replaces it with the (already existing) mru list used during
deletion. The "recent list" was not a true mru list leading to a number
of inefficiencies including a "next" function which made scanning the
list an order N^2 operation wrt to the number of list elements.

This should increase allocation performance with large numbers of rgrps.
Its also a useful preparation and cleanup before some further changes
which are planned in this area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 15:54:12 +01:00
Chuck Lever f45663ce5f NFS: Allow either strict or sloppy mount option parsing
The kernel's NFS client mount option parser currently doesn't allow
unrecognized or incorrect mount options.  This prevents misspellings or
incorrectly specified mount options from possibly causing silent data
corruption.

However, NFS mount options are not standardized, so different operating
systems can use differently spelled mount options to support similar
features, or can support mount options which no other operating system
supports.

"Sloppy" mount option parsing, which allows the parser to ignore any
option it doesn't recognize, is needed to support automounters that often
use maps that are shared between heterogenous operating systems.

The legacy mount command ignores the validity of the values of mount
options entirely, except for the "sec=" and "proto=" options.  If an
incorrect value is specified, the out-of-range value is passed to the
kernel; if a value is specified that contains non-numeric characters,
it appears as though the legacy mount command sets that option to zero
(probably incorrect behavior in general).

In any case, this sets a precedent which we will partially follow for
the kernel mount option parser:

	+ if "sloppy" is not set, the parser will be strict about both
	  unrecognized options (same as legacy) and invalid option
	  values (stricter than legacy)

	+ if "sloppy" is set, the parser will ignore unrecognized
	  options and invalid option values (same as legacy)

An "invalid" option value in this case means that either the type
(integer, short, or string) or sign (for integer values) of the specified
value is incorrect.

This patch does two things: it changes the NFS client's mount option
parsing loop so that it parses the whole string instead of failing at
the first unrecognized option or invalid option value.  An unrecognized
option or an invalid option value cause the option to be skipped.

Then, the patch adds a "sloppy" mount option that allows the parsing
to succeed anyway if there were any problems during parsing.  When
parsing a set of options is complete, if there are errors and "sloppy"
was specified, return success anyway.  Otherwise, only return success
if there are no errors.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:44 -04:00
Chuck Lever 6738b2512b NFS4: Set security flavor default for NFSv4 mounts like other defaults
Set the default security flavor when we set the other mount option
default values for NFSv4.  This cleans up the NFSv4 mount option parsing
path to look like the NFSv2/v3 one.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:43 -04:00
Chuck Lever dd07c94750 NFS: Set security flavor default for NFSv2/3 mounts like other defaults
Set the default security flavor when we set the other mount option default
values.  After this change, only the legacy user-space mount path needs to
set the NFS_MOUNT_SECFLAVOUR flag.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:42 -04:00
Chuck Lever 01060c896e NFS: Refactor logic for parsing NFS security flavor mount options
Clean up: Refactor the NFS mount option parsing function to extract the
security flavor parsing logic into a separate function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:41 -04:00
Chuck Lever 0e0cab744b NFS: use documenting macro constants for initializing ac{reg, dir}{min, max}
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:40 -04:00
Chuck Lever ed596a8adb NFS: Move the nfs_set_port() call out of nfs_parse_mount_options()
The remount path does not need to set the port in the server address.
Since it's not really a part of option parsing, move the nfs_set_port()
call to nfs_parse_mount_options()'s callers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:39 -04:00
Trond Myklebust 259875efed NFS: set transport defaults after mount option parsing is finished
Move the UDP/TCP default timeo/retrans settings for text mounts to
nfs_init_timeout_values(), which was were they were always being
initialised (and sanity checked) for binary mounts.
Document the default timeout values using appropriate #defines.

Ensure that we initialise and sanity check the transport protocols that
may have been specified by the user.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:38 -04:00
Chuck Lever 40fef8a649 SUNRPC: Use only rpcbind v2 for AF_INET requests
Some server vendors support the higher versions of rpcbind only for
AF_INET6.  The kernel doesn't need to use v3 or v4 for AF_INET anyway,
so change the kernel's rpcbind client to query AF_INET servers over
rpcbind v2 only.

This has a few interesting benefits:

1. If the rpcbind request is going over TCP, and the server doesn't
   support rpcbind versions 3 or 4, the client reduces by two the number
   of ephemeral ports left in TIME_WAIT for each rpcbind request.  This
   will help during NFS mount storms.

2. The rpcbind interaction with servers that don't support rpcbind
   versions 3 or 4 will use less network traffic.  Also helpful
   during mount storms.

3. We can eliminate the kernel build option that controls whether the
   kernel's rpcbind client uses rpcbind version 3 and 4 for AF_INET
   servers.  Less complicated kernel configuration...

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:37 -04:00
Jeff Layton 5afc597c5f nfs4: fix potential race with rapid nfs_callback_up/down cycle
If the nfsv4 callback thread is rapidly brought up and down, it's
possible that nfs_callback_svc might never get a chance to run. If
this happens, the cleanup at thread exit might never occur, throwing
the refcounting off and nfs_callback_info in an incorrect state.

Move the clean functions into nfs_callback_down. Also change the
nfs_callback_info struct to track the svc_rqst rather than svc_serv
since we need to know that to call svc_exit_thread.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:32 -04:00
Jeff Layton ee84dfc454 nfs4: remove BKL from nfs_callback_up and nfs_callback_down
The nfs_callback_mutex is sufficient protection.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:31 -04:00
Benny Halevy 77e03677ac nfs: initialize timeout variable in nfs4_proc_setclientid_confirm
gcc (4.3.0) rightfully warns about this:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c: In function ‘nfs4_proc_setclientid_confirm’:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c:2936: warning: ‘timeout’ may be used uninitialized in this function

nfs4_delay that's passed a pointer to 'timeout' is looking at its value
and sets it up to some value in the range: NFS4_POLL_RETRY_MIN..NFS4_POLL_RETRY_MAX
	if (*timeout <= 0)
		*timeout = NFS4_POLL_RETRY_MIN;
	if (*timeout > NFS4_POLL_RETRY_MAX)
		*timeout = NFS4_POLL_RETRY_MAX;

Therefore it will end up set to some sane, though rather indeterministic, value.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:30 -04:00
Chuck Lever d8e7748ab8 NFS: handle interface identifiers in incoming IPv6 addresses
Add support in the kernel NFS client's address parser for interface
identifiers.

IPv6 link-local addresses require an additional "interface identifier",
which is a network device name or an integer that indexes the array of
local network interfaces.  They are suffixed to the address with a '%'.
For example:

	fe80::215:c5ff:fe3b:e1b2%2

indicates an interface index of 2.  Or

	fe80::215:c5ff:fe3b:e1b2%eth0

indicates that requests should be routed through the eth0 device.
Without the interface ID, link-local addresses are not usable for NFS.

Both the kernel NFS client mount option parser and the mount.nfs command
can take either form.  The mount.nfs command always passes the address
through getnameinfo(3), which usually re-writes interface indices as
device names.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:29 -04:00
Chuck Lever ce3b7e1906 NFS: Add string length argument to nfs_parse_server_address
To make nfs_parse_server_address() more generally useful, allow it to
accept input strings that are not terminated with '\0'.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:28 -04:00
Chuck Lever d1aa082573 NFS: Support raw IPv6 address hostnames during NFS mount operation
Traditionally the mount command has looked for a ":" to separate the
server's hostname from the export path in the mounted on device name,
like this:

	mount server:/export /mounted/on/dir

The server's hostname is "server" and the export path is "/export".

You can also substitute a specific IPv4 network address for the server
hostname, like this:

	mount 192.168.0.55:/export /mounted/on/dir

Raw IPv6 addresses present a problem, however, because they look
something like this:

	fe80::200:5aff:fe00:30b

Note the use of colons.

To get around the presence of colons, copy the Solaris convention used for
mounting IPv6 servers by address: wrap a raw IPv6 address with square
brackets.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:27 -04:00
Chuck Lever dc04589827 NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3
To support passing a raw IPv6 address as a server hostname, we need to
expand the logic that handles splitting the passed-in device name into
a server hostname and export path

Start by pulling device name parsing out of the mount option validation
functions and into separate helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:26 -04:00
Trond Myklebust cd10072562 NFS: Fix a dependency on CONFIG_NFS_V4 in nfs_remount
Fix the 'nfs4_fs_type' undeclared error in nfs_remount when compiling sans
NFSv4...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jeff Layton <jlayton@redhat.com>
2008-07-09 12:09:25 -04:00
Trond Myklebust e468bae97d NFS: Allow redirtying of a completed unstable write.
Currently, if an unstable write completes, we cannot redirty the page in
order to reflect a new change in the page data until after we've sent a
COMMIT request.

This patch allows a page rewrite to proceed without the unnecessary COMMIT
step, putting it immediately back onto the dirty page list, undoing the
VM unstable write accounting, and removing the NFS_PAGE_TAG_COMMIT tag from
the NFS radix tree.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:24 -04:00
Trond Myklebust e7d39069e3 NFS: Clean up nfs_update_request()
Simplify the loop in nfs_update_request by moving into a separate function
the code that attempts to update an existing cached NFS write.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:23 -04:00
Chuck Lever 396cee977f NFS: missing newline in NFS mount debugging message
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:22 -04:00
Chuck Lever d33e4dfeab NFS: Treat "intr" and "nointr" options as deprecated
Clean up:  the "intr" and "nointr" mount options were recently retired.
Document this in the NFS mount option parser.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:21 -04:00
Chuck Lever ecbb3845dd NFS: Allow any value for the "retry" option
The kernel NFS mount option parser should ignore the retry= mount option
since it is meaningful only in user space.  Today it expects a number
rather than arbitrary text, so it ignores the option if the value is
numeric, but chokes if there are other characters in the value.

Change it to allow any text (except ",") as its value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:20 -04:00
Trond Myklebust f41f741838 NFS: Ensure we zap only the access and acl caches when setting new acls
...and ensure that we obey the NFS_INO_INVALID_ACL flag when retrieving the
acls.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:19 -04:00
Trond Myklebust 2e96d28672 NFS: Fix a warning in nfs4_async_handle_error
We're not modifying the nfs_server when we call nfs_inc_server_stats and
friends, so allow the compiler to pass 'const' pointers too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:18 -04:00
Chuck Lever 34e8f92831 NFS: Move fs/nfs/iostat.h to include/linux
The fs/nfs/iostat.h header has definitions that were designed to be exposed
to user space.  Move these definitions under include/linux so user space can
use the definitions in applications that read /proc/self/mountstats.

Also address a handful of coding style issues called out by checkpatch.pl in
fs/nfs/iostat.h.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:17 -04:00
Trond Myklebust 46cb650c22 NFS: Remove the redundant file_open entry from struct nfs_rpc_ops
All instances are set to nfs_open(), so we should just remove the redundant
indirection. Ditto for the file_release op

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:16 -04:00