Commit Graph

572 Commits

Author SHA1 Message Date
Linus Torvalds a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Jan Kara 8e8ad8a57c ext4: Convert to new freezing mechanism
We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).

CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:45:48 +04:00
Linus Torvalds 173f865474 The usual collection of bug fixes and optimizations. Perhaps of
greatest note is a speed up for parallel, non-allocating DIO writes,
 since we no longer take the i_mutex lock in that case.  For bug fixes,
 we fix an incorrect overhead calculation which caused slightly
 incorrect results for df(1) and statfs(2).  We also fixed bugs in the
 metadata checksum feature.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQE1SzAAoJENNvdpvBGATwzOMQAKxVkaTqkMc+cUahLLUkFdGQ
 xnmigHufVuOvv+B8l1p6vbYx+qOftztqXF1WC41j8mTfEDs19jKXv3LI57TJ+bIW
 a/Kp1CjMicRs/HGhFPNWp1D+N8J95MTFP6Y9biXSmBBvefg2NSwxpg48yZtjUy1/
 Zl0414AqMvTJyqKKOUa++oyl3XlawnzDZ+6a7QPKsrAaDOU5977W4y2tZkNFk84d
 qRjTfaiX13aVe7XupgQHe/jtk40Sj38pyGVGiAGlHOZhZtYKE6MzB8OreGiMTy8z
 rmJz/CQUsQ8+sbKYhAcDru+bMrKghbO0u78XRz9CY+YpVFF/Xs2QiXoV0ZOGkIm6
 eokq7jEV0K+LtDEmM3PkmUPgXfYw5damTv8qWuBUFd4wtVE5x/DmK8AJVMidCAUN
 GkVR+rEbbEi7RCwsuac/aKB8baVQCTiJ5tfNTgWh9zll+9GZSk+U71Pp0KdcJGiS
 nxitAZ+20hZN2CQctlmaGbCPTPYCWU4hQ3IuMdTlQTQAs8S0y1FtylTRsXcC1eVR
 i1hBS/dVw5PVCaqoX79zYrByUymgX0ZaYY6seRT6+U9xPGDCSNJ9mjXZL3fttnOC
 d5gsx/pbMIAv52G5Hj6DfsXR2JFmmxsaIzsLtRvKi9q89d84XaZfbUsHYjn4Neym
 5lTKaSQHU71cKCxrStHC
 =VAVB
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The usual collection of bug fixes and optimizations.  Perhaps of
  greatest note is a speed up for parallel, non-allocating DIO writes,
  since we no longer take the i_mutex lock in that case.

  For bug fixes, we fix an incorrect overhead calculation which caused
  slightly incorrect results for df(1) and statfs(2).  We also fixed
  bugs in the metadata checksum feature."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: undo ext4_calc_metadata_amount if we fail to claim space
  ext4: don't let i_reserved_meta_blocks go negative
  ext4: fix hole punch failure when depth is greater than 0
  ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
  ext4: weed out ext4_write_super
  ext4: remove unnecessary superblock dirtying
  ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
  ext4: remove useless marking of superblock dirty
  ext4: fix ext4 mismerge back in January
  ext4: remove dynamic array size in ext4_chksum()
  ext4: remove unused variable in ext4_update_super()
  ext4: make quota as first class supported feature
  ext4: don't take the i_mutex lock when doing DIO overwrites
  ext4: add a new nolock flag in ext4_map_blocks
  ext4: split ext4_file_write into buffered IO and direct IO
  ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()
  ext4: fix out-of-date comments in extents.c
  ext4: use s_csum_seed instead of i_csum_seed for xattr block
  ext4: use proper csum calculation in ext4_rename
  ext4: fix overhead calculation used by ext4_statfs()
  ...
2012-07-27 20:52:25 -07:00
Artem Bityutskiy 4d47603d97 ext4: weed out ext4_write_super
We do not depend on VFS's '->write_super()' anymore and do not need
the 's_dirt' flag anymore, so weed out 'ext4_write_super()' and
's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:35:31 -04:00
Artem Bityutskiy 58c5873a76 ext4: remove unnecessary superblock dirtying
This patch changes the 'ext4_handle_dirty_super()' function which
submits the superblock for I/O in the following cases:

1. When creating the first large file on a file system without
   EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature.
2. When re-sizing the file-system.
3. When creating an xattr on a file-system without the
   EXT4_FEATURE_COMPAT_EXT_ATTR feature.

If the file-system has journal enabled, the superblock is written via
the journal. We do not modify this path.

If the file-system has no journal, this function, falls back to just
marking the superblock as dirty using the 's_dirt' superblock
flag. This means that it delays the actual superblock I/O submission
by 5 seconds (default setting).  Namely, the 'sync_supers()' kernel
thread will call 'ext4_write_super()' later and will actually submit
the superblock for I/O.

And this is the behavior this patch modifies: we stop using 's_dirt'
and just mark the superblock buffer as dirty right away. Indeed, all 3
cases above are extremely rare and it does not add any value to delay
the I/O submission for them.

Note: 'ext4_handle_dirty_super()' executes
'__ext4_handle_dirty_super()' with 'now = 0'. This patch basically
makes the 'now' argument unneeded and it will be deleted in one of the
next patches.

This patch also removes 's_dirt' condition on the unmount path because
we never set it anymore, so we should not test it.

Tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:33:31 -04:00
Aditya Kali 7c319d3285 ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.

It is based on the proposal at:                                                                                                           
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4

This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields.  Some changes introduced by this patch that should be pointed
out are:

1) Two new ext4-superblock fields - s_usr_quota_inum and
   s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
   for tracking group quota. The superblock fields can be set to use
   other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
   superblock, the quota usage tracking is turned on at mount time. On
   'quotaon' ioctl, the quota limits enforcement is turned
   on. 'quotaoff' ioctl turns off only the limits enforcement in this
   case.
4) When QUOTA feature is in use, the quota mount options 'quota',
   'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
   quota inodes. The default reserved inodes will not be visible to user
   as regular files.
6) The quota-tools will need to be modified to support hidden quota
   files on ext4. E2fsprogs will also include support for creating and
   fixing quota files.
7) Support is only for the new V2 quota file format.

Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:21:31 -04:00
Jan Kara a117782571 quota: Move quota syncing to ->sync_fs method
Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).

So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:34 +04:00
Theodore Ts'o 952fc18ef9 ext4: fix overhead calculation used by ext4_statfs()
Commit f975d6bcc7 introduced bug which caused ext4_statfs() to
miscalculate the number of file system overhead blocks.  This causes
the f_blocks field in the statfs structure to be larger than it should
be.  This would in turn cause the "df" output to show the number of
data blocks in the file system and the number of data blocks used to
be larger than they should be.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-07-09 16:27:05 -04:00
Linus Torvalds 4edebed866 Ext4 updates for 3.5
The major new feature added in this update is Darrick J. Wong's
 metadata checksum feature, which adds crc32 checksums to ext4's
 metadata fields.  There is also the usual set of cleanups and bug
 fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPyNleAAoJENNvdpvBGATwtLMP/i3WsPyTvxmYP6HttHXQb8Jk
 GYCoTQ5bZMuTbOwOGg3w137cXWBv5uuPpxIk79YVLHSWx6HuanlGIa7/VnPKIaLu
 2ihuvVfnrDqpwQ4MJaSq4R1Eka9JCwZ7HbYYo+fYOVobxgw588JVV9VVI9EdKRGz
 z11UkW8iHE0f6Xa5gOhdAMkR0uaPnxwJX/qHZYiHuognRivuwMglqWJSiMr8nQmo
 A2GmeoLehhW+k65IqgTCmSW6ZgFTvZdk6bskQIij3fOYHW3hHn/gcLFtmLTIZ/B5
 LZdg/lngPYve+R/UyypliGKi+pv1qNEiTiBm0rrBgsdZFkBdGj0soSvGZzeK+Mp4
 Q1vAmOBPYPFzs6nVzPst2n/osryyykFCK6TgSGZ50dosJ0NO8cBeDdX/gh9JKD2R
 yQUMUltOCCSj/eWU4iwqZ0T3FXRiH/+S3XMHznoKJiwUyGDBNQy4+Yg2k2WzUXrz
 Cu5t5BwNG2WNP7y5Et/wmUIzpC7VPId4qYmGyHe7OwTxSJgW+6f7GVkHfjWcDMuv
 pGgEUiInbMmLajP3v2/LKfVU4hXLZy4uJbhoBgDdeIpZrnPifJG/MwDOS4W+dLVT
 tDzgO1SAh3/E4jATreZ5bjzD/HGsfe1OX09UH3Pbc1EcgkrLnyrQXFwdHshdVu4A
 cxMoKNPVCQJySb1UrLkO
 =SdJJ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull Ext4 updates from Theodore Ts'o:
 "The major new feature added in this update is Darrick J Wong's
  metadata checksum feature, which adds crc32 checksums to ext4's
  metadata fields.

  There is also the usual set of cleanups and bug fixes."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: hole-punch use truncate_pagecache_range
  jbd2: use kmem_cache_zalloc wrapper instead of flag
  ext4: remove mb_groups before tearing down the buddy_cache
  ext4: add ext4_mb_unload_buddy in the error path
  ext4: don't trash state flags in EXT4_IOC_SETFLAGS
  ext4: let getattr report the right blocks in delalloc+bigalloc
  ext4: add missing save_error_info() to ext4_error()
  ext4: add debugging trigger for ext4_error()
  ext4: protect group inode free counting with group lock
  ext4: use consistent ssize_t type in ext4_file_write()
  ext4: fix format flag in ext4_ext_binsearch_idx()
  ext4: cleanup in ext4_discard_allocated_blocks()
  ext4: return ENOMEM when mounts fail due to lack of memory
  ext4: remove redundundant "(char *) bh->b_data" casts
  ext4: disallow hard-linked directory in ext4_lookup
  ext4: fix potential integer overflow in alloc_flex_gd()
  ext4: remove needs_recovery in ext4_mb_init()
  ext4: force ro mount if ext4_setup_super() fails
  ext4: fix potential NULL dereference in ext4_free_inodes_counts()
  ext4/jbd2: add metadata checksumming to the list of supported features
  ...
2012-06-01 10:12:15 -07:00
Theodore Ts'o f3fc0210c0 ext4: add missing save_error_info() to ext4_error()
The ext4_error() function is missing a call to save_error_info().
Since this is the function which marks the file system as containing
an error, this oversight (which was introduced in 2.6.36) is quite
significant, and should be backported to older stable kernels with
high urgency.

Reported-by: Ken Sumrall <ksumrall@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ksumrall@google.com
Cc: stable@kernel.org
2012-05-30 23:00:16 -04:00
Theodore Ts'o 2c0544b235 ext4: add debugging trigger for ext4_error()
Make it easy to test whether or not the error handling subsystem in
ext4 is working correctly.  This allows us to simulate an ext4_error()
by echoing a string to /sys/fs/ext4/<dev>/trigger_fs_error.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ksumrall@google.com
2012-05-30 22:56:46 -04:00
Theodore Ts'o 2cde417de0 ext4: return ENOMEM when mounts fail due to lack of memory
This is a port of the ext3 commit: 4569cd1b0d

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:49:54 -04:00
Theodore Ts'o 2716b80284 ext4: remove redundundant "(char *) bh->b_data" casts
The b_data field of the buffer_head is already a char *, so there's no
point casting it to a char *.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:47:52 -04:00
Akira Fujita 9d99012ff2 ext4: remove needs_recovery in ext4_mb_init()
needs_recovery in ext4_mb_init() is not used, remove it.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 14:19:25 -04:00
Eric Sandeen 7e84b62164 ext4: force ro mount if ext4_setup_super() fails
If ext4_setup_super() fails i.e. due to a too-high revision,
the error is logged in dmesg but the fs is not mounted RO as
indicated.

Tested by:

# mkfs.ext4 -r 4 /dev/sdb6
# mount /dev/sdb6 /mnt/test
# dmesg | grep "too high"
[164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode
# grep sdb6 /proc/mounts
/dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0

Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-05-28 14:17:25 -04:00
Linus Torvalds 90324cc1b1 avoid iput() from flusher thread
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
 uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
 +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
 KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
 ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
 tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
 Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
 NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
 /9c4zxxekjHZnn2oooEa
 =oLXf
 -----END PGP SIGNATURE-----

Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback tree from Wu Fengguang:
 "Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Avoid iput() from flusher thread
  vfs: Rename end_writeback() to clear_inode()
  vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
  writeback: Refactor writeback_single_inode()
  writeback: Remove wb->list_lock from writeback_single_inode()
  writeback: Separate inode requeueing after writeback
  writeback: Move I_DIRTY_PAGES handling
  writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  writeback: Move clearing of I_SYNC into inode_sync_complete()
  writeback: initialize global_dirty_limit
  fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
  mm: page-writeback.c: local functions should not be exposed globally
2012-05-28 09:54:45 -07:00
Darrick J. Wong 25ed6e8a54 jbd2: enable journal clients to enable v2 checksumming
Add in the necessary code so that journal clients can enable the new
journal checksumming features.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-27 07:48:56 -04:00
Linus Torvalds ece78b7df7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, ext3 and quota fixes from Jan Kara:
 "Interesting bits are:
   - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since
     quota code does not need i_mutex anymore in any unusual way.
   - backport (from ext4) of a fix of a checkpointing bug (missing cache
     flush) that could lead to fs corruption on power failure

  The rest are just random small fixes & cleanups."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: trivial fix to comment for ext2_free_blocks
  ext2: remove the redundant comment for ext2_export_ops
  ext3: return 32/64-bit dir name hash according to usage type
  quota: Get rid of nested I_MUTEX_QUOTA locking subclass
  quota: Use precomputed value of sb_dqopt in dquot_quota_sync
  ext2: Remove i_mutex use from ext2_quota_write()
  reiserfs: Remove i_mutex use from reiserfs_quota_write()
  ext4: Remove i_mutex use from ext4_quota_write()
  ext3: Remove i_mutex use from ext3_quota_write()
  quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG
  jbd: Write journal superblock with WRITE_FUA after checkpointing
  jbd: protect all log tail updates with j_checkpoint_mutex
  jbd: Split updating of journal superblock and marking journal empty
  ext2: do not register write_super within VFS
  ext2: Remove s_dirt handling
  ext2: write superblock only once on unmount
  ext3: update documentation with barrier=1 default
  ext3: remove max_debt in find_group_orlov()
  jbd: Refine commit writeout logic
2012-05-25 08:14:59 -07:00
Linus Torvalds 644473e9c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace enhancements from Eric Biederman:
 "This is a course correction for the user namespace, so that we can
  reach an inexpensive, maintainable, and reasonably complete
  implementation.

  Highlights:
   - Config guards make it impossible to enable the user namespace and
     code that has not been converted to be user namespace safe.

   - Use of the new kuid_t type ensures the if you somehow get past the
     config guards the kernel will encounter type errors if you enable
     user namespaces and attempt to compile in code whose permission
     checks have not been updated to be user namespace safe.

   - All uids from child user namespaces are mapped into the initial
     user namespace before they are processed.  Removing the need to add
     an additional check to see if the user namespace of the compared
     uids remains the same.

   - With the user namespaces compiled out the performance is as good or
     better than it is today.

   - For most operations absolutely nothing changes performance or
     operationally with the user namespace enabled.

   - The worst case performance I could come up with was timing 1
     billion cache cold stat operations with the user namespace code
     enabled.  This went from 156s to 164s on my laptop (or 156ns to
     164ns per stat operation).

   - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
     Most uid/gid setting system calls treat these value specially
     anyway so attempting to use -1 as a uid would likely cause
     entertaining failures in userspace.

   - If setuid is called with a uid that can not be mapped setuid fails.
     I have looked at sendmail, login, ssh and every other program I
     could think of that would call setuid and they all check for and
     handle the case where setuid fails.

   - If stat or a similar system call is called from a context in which
     we can not map a uid we lie and return overflowuid.  The LFS
     experience suggests not lying and returning an error code might be
     better, but the historical precedent with uids is different and I
     can not think of anything that would break by lying about a uid we
     can't map.

   - Capabilities are localized to the current user namespace making it
     safe to give the initial user in a user namespace all capabilities.

  My git tree covers all of the modifications needed to convert the core
  kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  userns:  Silence silly gcc warning.
  cred: use correct cred accessor with regards to rcu read lock
  userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
  userns: Convert cgroup permission checks to use uid_eq
  userns: Convert tmpfs to use kuid and kgid where appropriate
  userns: Convert sysfs to use kgid/kuid where appropriate
  userns: Convert sysctl permission checks to use kuid and kgids.
  userns: Convert proc to use kuid/kgid where appropriate
  userns: Convert ext4 to user kuid/kgid where appropriate
  userns: Convert ext3 to use kuid/kgid where appropriate
  userns: Convert ext2 to use kuid/kgid where appropriate.
  userns: Convert devpts to use kuid/kgid where appropriate
  userns: Convert binary formats to use kuid/kgid where appropriate
  userns: Add negative depends on entries to avoid building code that is userns unsafe
  userns: signal remove unnecessary map_cred_ns
  userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
  userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
  userns: Convert stat to return values mapped from kuids and kgids
  userns: Convert user specfied uids and gids in chown into kuids and kgid
  userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
  ...
2012-05-23 17:42:39 -07:00
Theodore Ts'o f32aaf2d2b ext4: enable the 64-bit jbd2 feature based on the 64-bit ext4 feature
Previously we were only enabling the 64-bit jbd2 feature if the number
of blocks in the file system was greater 2**32-1.  The problem with
this is that it makes it harder to test the 64-bit journal code paths
with small file systems, since a small test file system would with the
64-bit ext4 feature enable would use a 64-bit file system on-disk data
structures, but use a 32-bit journal.

This would also cause problems when trying to do an online resize to
grow the filesystem above the 2**32-1 boundary.  Fortunately the patch
to support online resize for 64-bit file systems hasn't been merged
yet, so this problem hasn't arisen in practice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-21 11:42:02 -04:00
Eric W. Biederman 08cefc7ab8 userns: Convert ext4 to user kuid/kgid where appropriate
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:27 -07:00
Jan Kara 0b7f7cefae ext4: Remove i_mutex use from ext4_quota_write()
We don't need i_mutex in ext4_quota_write() because writes to quota file
are serialized by dqio_mutex anyway. Changes to quota files outside of quota
code are forbidded and enforced by NOATIME and IMMUTABLE bits.

Signed-off-by: Jan Kara <jack@suse.cz>
2012-05-15 23:34:38 +02:00
Jan Kara dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Darrick J. Wong feb0ab32a5 ext4: make block group checksums use metadata_csum algorithm
metadata_csum supersedes uninit_bg.  Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set.  Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:45:10 -04:00
Darrick J. Wong a9c4731780 ext4: calculate and verify superblock checksum
Calculate and verify the superblock checksum.  Since the UUID and
block group number are embedded in each copy of the superblock, we
need only checksum the entire block.  Refactor some of the code to
eliminate open-coding of the checksum update call.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:29:10 -04:00
Darrick J. Wong 0441984a33 ext4: load the crc32c driver if necessary
Obtain a reference to the cryptoapi and crc32c if we mount a
filesystem with metadata checksumming enabled.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:27:10 -04:00
Darrick J. Wong d25425f8e0 ext4: record the checksum algorithm in use in the superblock
Record the type of checksum algorithm we're using for metadata in the
superblock, in case we ever want/need to change the algorithm.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:25:10 -04:00
Eldad Zack db7e5c668e super.c: unused variable warning without CONFIG_QUOTA
sb info is only checked with quota support.

fs/ext4/super.c: In function ‘parse_options’:
fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable]

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2012-04-23 21:44:41 -04:00
Theodore Ts'o 57f73c2c89 ext4: fix handling of journalled quota options
Commit 26092bf5 broke handling of journalled quota mount options by
trying to parse argument of every mount option as a number.  Fix this
by dealing with the quota options before we call match_int().

Thanks to Jan Kara for discovering this regression.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-04-16 18:55:26 -04:00
Theodore Ts'o 9cd70b347e ext4: address scalability issue by removing extent cache statistics
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-04-16 12:16:20 -04:00
Linus Torvalds 69e1aaddd6 Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes
The changes to export dirty_writeback_interval are from Artem's s_dirt
 cleanup patch series.  The same is true of the change to remove the
 s_dirt helper functions which never got used by anyone in-tree.  I've
 run these changes by Al Viro, and am carrying them so that Artem can
 more easily fix up the rest of the file systems during the next merge
 window.  (Originally we had hopped to remove the use of s_dirt from
 ext4 during this merge window, but his patches had some bugs, so I
 ultimately ended dropping them from the ext4 tree.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPb39rAAoJENNvdpvBGATwVz8P/3V1NqSsk20VJOLbmEE45GxL
 GDzQJ6OsFG0UiQk6ISSrSdwxfav/KTCGySsU9UtAoOdPcBwnnsf8S7wc6OggwwuC
 hBFGwwFzk6YSQaZ58sUxWRGeOJuP/FPem6Id6buC4DQ1KIcznP/hEEgEnh/ir4Ec
 vrsfexY93TR8BE2Mi23v2epDVLU0B6bY/w9nDqbTXif3xN/gh/ypoHHouuM6Bs2n
 TyWHOwD15NwfnvRHd8PfDDqQM/D29x3QI0FMrWj9McpwIz4d4cBfhN4LQ/G+yLDY
 izv5DM10GbinwHPrsOTGVAW3KIdSS9rP3jCJGVuOrJZ9ufGXosvHuIYVhI7J3SBK
 JhBu6QEsN1IsvlVYpz9q8mqVKaDXQLsz2eaTw+i4yfmyOk1kOX7nIEOxYFF78G+V
 Of/W1SpIpJQaXvLHRcDj9fDj0fZTciUZA8v7/HOFS+co2dzIl0iZbcfBFp0/56RY
 sWdQoeRlx1ciVDPR+w2TQO5w3VWQw1gT5aqux0NiPj0XFoiUHScxgNGAYbqENMQw
 v9chvyDMlorqj0rF/Vey5SssgEDi7MTdYuYTi4YyMqr7pcvOJaO85pf+wH9g2eKW
 XhW33PhPGuwCJDP5Pg8Y0Z2Hp/Q3DCqhLqhGfTyAs/NG9+hR4wgp3VWb8CUqhA1t
 C/yzNeOYqScAefCzQx2V
 =+9zk
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates for 3.4 from Ted Ts'o:
 "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

  The changes to export dirty_writeback_interval are from Artem's s_dirt
  cleanup patch series.  The same is true of the change to remove the
  s_dirt helper functions which never got used by anyone in-tree.  I've
  run these changes by Al Viro, and am carrying them so that Artem can
  more easily fix up the rest of the file systems during the next merge
  window.  (Originally we had hopped to remove the use of s_dirt from
  ext4 during this merge window, but his patches had some bugs, so I
  ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
  vfs: remove unused superblock helpers
  mm: export dirty_writeback_interval
  ext4: remove useless s_dirt assignment
  ext4: write superblock only once on unmount
  ext4: do not mark superblock as dirty unnecessarily
  ext4: correct ext4_punch_hole return codes
  ext4: remove restrictive checks for EOFBLOCKS_FL
  ext4: always set then trimmed blocks count into len
  ext4: fix trimmed block count accunting
  ext4: fix start and len arguments handling in ext4_trim_fs()
  ext4: update s_free_{inodes,blocks}_count during online resize
  ext4: change some printk() calls to use ext4_msg() instead
  ext4: avoid output message interleaving in ext4_error_<foo>()
  ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
  ext4: add no_printk argument validation, fix fallout
  ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
  ext4: give more helpful error message in ext4_ext_rm_leaf()
  ext4: remove unused code from ext4_ext_map_blocks()
  ext4: rewrite punch hole to use ext4_ext_remove_space()
  jbd2: cleanup journal tail after transaction commit
  ...
2012-03-28 10:02:55 -07:00
Artem Bityutskiy 182f514f88 ext4: remove useless s_dirt assignment
Clean-up ext4 a tiny bit by removing useless s_dirt assignment in
'ext4_fill_super()' because a bit later we anyway call
'ext4_setup_super()' which writes the superblock to the media
unconditionally.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:30:06 -04:00
Artem Bityutskiy a8e25a8324 ext4: write superblock only once on unmount
In some rather rare cases it is possible that ext4 may the superblock
to the media twice. This patch makes sure this does not happen. This
should speed up unmounting in those cases.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:29:15 -04:00
Al Viro 07c0c5d8b8 ext4: initialization of ext4_li_mtx needs to be done earlier
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 22:05:02 -04:00
Al Viro 48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Theodore Ts'o 92b9781658 ext4: change some printk() calls to use ext4_msg() instead
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:41:49 -04:00
Joe Perches d9ee81da93 ext4: avoid output message interleaving in ext4_error_<foo>()
Using KERN_CONT means that messages from multiple threads may be
interleaved.  Avoid this by using a single printk call in
ext4_error_inode and ext4_error_file.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:15:43 -04:00
Theodore Ts'o f70486055e ext4: try to deprecate noacl and noxattr_user mount options
No other file system allows ACL's and extended attributes to be
enabled or disabled via a mount option.  So let's try to deprecate
these options from ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 22:06:20 -05:00
Theodore Ts'o c7198b9c1e ext4: ignore mount options supported by ext2/3 (but have since been removed)
Users who tried to use the ext4 file system driver is being used for
the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23
option) could have failed mounts if their /etc/fstab contains options
recognized by ext2 or ext3 but which have since been removed in ext4.

So teach ext4 to recognize them and give a warning that the mount
option was removed.

Report: https://bbs.archlinux.org/profile.php?id=33804

Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Baechler <thomas@archlinux.org>
Cc: Tobias Powalowski <tobias.powalowski@googlemail.com>
Cc: Dave Reisner <d@falconindy.com>
2012-03-04 22:00:53 -05:00
Theodore Ts'o 66acdcf4ea ext4: add debugging /proc file showing file system options
Now that /proc/mounts is consistently showing only those mount options
which need to be specified in /etc/fstab or on the mount command line,
it is useful to have file which shows exactly which file system
options are enabled.  This can be useful when debugging a user
problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 20:21:38 -05:00
Theodore Ts'o 5a916be1b3 ext4: make ext4_show_options() be table-driven
Consistently show mount options which are the non-default, so that
/proc/mounts accurately shows the mount options that would be
necessary to mount the file system in its current mode of operation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 19:27:31 -05:00
Theodore Ts'o 2adf6da837 ext4: move ext4_show_options() after parse_options()
This commit is strictly a code movement so in preparation of changing
ext4_show_options to be table driven.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:50 -05:00
Theodore Ts'o 26092bf524 ext4: use a table-driven handler for mount options
By using a table-drive approach, we shave about 100 lines of code from
ext4, and make the code a bit more regular and factored out.  This
will also make it possible in a future patch to use this table for
displaying the mount options that were specified in /proc/mounts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:47 -05:00
Theodore Ts'o 72578c33c4 ext4: unify handling of mount options which have been removed
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 18:04:40 -05:00
Theodore Ts'o 39ef17f1b0 ext4: simplify handling of the errors=* mount options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 17:56:23 -05:00
Theodore Ts'o c64db50e76 ext4: remove the I_VERSION mount flag and use the super_block flag instead
There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:23:11 -05:00
Theodore Ts'o ee4a3fcd1d ext4: remove Opt_ignore
This is completely unused so let's just get rid of it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:14:24 -05:00
Theodore Ts'o 87f26807e9 ext4: remove deprecation warnings for minix_df and grpid
People complained about removing both of these features, so per
Linus's dictate, we won't be able to remove them.  Sigh...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 00:03:21 -05:00
Eric Sandeen 661aa52057 ext4: remove the resize mount option
The resize mount option seems to be of limited value,
especially in the age of online resize2fs.  Nuke it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Eric Sandeen 43e625d84f ext4: remove the journal=update mount option
The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Bobi Jam 18aadd47f8 ext4: expand commit callback and
The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Theodore Ts'o ff9cb1c4ee Merge branch 'for_linus' into for_linus_merged
Conflicts:
	fs/ext4/ioctl.c
2012-01-10 11:54:07 -05:00
Xi Wang d50f2ab6f0 ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01 ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.

	sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
	groups_per_flex = 1 << sbi->s_log_groups_per_flex;

	if (groups_per_flex < 2) { ... }

This patch fixes two potential issues in the previous commit.

1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount.  That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0.  This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.

2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways.  Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.

	groups_per_flex = 1 << sbi->s_log_groups_per_flex;
	if (groups_per_flex == 0 || groups_per_flex == 1) {

We compile the code snippet using Clang 3.0 and GCC 4.6.  Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original.  GCC keeps the check, but
there is no guarantee that future versions will do the same.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 11:51:10 -05:00
Al Viro 94bf608a18 ext4: fix failure exits
a) leaking root dentry is bad
b) in case of failed ext4_mb_init() we don't want to do ext4_mb_release()
c) OTOH, in the same case we *do* want ext4_ext_release()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09 15:57:20 -05:00
Linus Torvalds eb59c505f8 Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
  PM / Hibernate: Implement compat_ioctl for /dev/snapshot
  PM / Freezer: fix return value of freezable_schedule_timeout_killable()
  PM / shmobile: Allow the A4R domain to be turned off at run time
  PM / input / touchscreen: Make st1232 use device PM QoS constraints
  PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
  PM / shmobile: Remove the stay_on flag from SH7372's PM domains
  PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
  PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
  PM: Drop generic_subsys_pm_ops
  PM / Sleep: Remove forward-only callbacks from AMBA bus type
  PM / Sleep: Remove forward-only callbacks from platform bus type
  PM: Run the driver callback directly if the subsystem one is not there
  PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
  PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
  PM / Sleep: Merge internal functions in generic_ops.c
  PM / Sleep: Simplify generic system suspend callbacks
  PM / Hibernate: Remove deprecated hibernation snapshot ioctls
  PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
  ARM: S3C64XX: Implement basic power domain support
  PM / shmobile: Use common always on power domain governor
  ...

Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
2012-01-08 13:10:57 -08:00
Al Viro 34c80b1d93 vfs: switch ->show_options() to struct dentry *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:19:54 -05:00
Al Viro d8c9584ea2 vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:16:53 -05:00
Ben Hutchings 1d526fc91b ext4: Report max_batch_time option correctly
Currently the value reported for max_batch_time is really the
value of min_batch_time.

Reported-by: Russell Coker <russell@coker.com.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-04 21:22:51 -05:00
Al Viro 6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Rafael J. Wysocki b00f4dc5ff Merge branch 'master' into pm-sleep
* master: (848 commits)
  SELinux: Fix RCU deref check warning in sel_netport_insert()
  binary_sysctl(): fix memory leak
  mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
  ipmi_watchdog: restore settings when BMC reset
  oom: fix integer overflow of points in oom_badness
  memcg: keep root group unchanged if creation fails
  nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
  nilfs2: unbreak compat ioctl
  cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  evm: prevent racing during tfm allocation
  evm: key must be set once during initialization
  mmc: vub300: fix type of firmware_rom_wait_states module parameter
  Revert "mmc: enable runtime PM by default"
  mmc: sdhci: remove "state" argument from sdhci_suspend_host
  x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
  IB/qib: Correct sense on freectxts increment and decrement
  RDMA/cma: Verify private data length
  cgroups: fix a css_set not found bug in cgroup_attach_proc
  oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
  Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
  ...

Conflicts:
	kernel/cgroup_freezer.c
2011-12-21 21:59:45 +01:00
Zheng Liu 5635a62b83 ext4: add missing space to ext4_msg output in ext4_fill_super()
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-18 16:13:58 -05:00
Theodore Ts'o fc6cb1cda5 ext4: display the correct mount option in /proc/mounts for [no]init_itable
/proc/mounts was showing the mount option [no]init_inode_table when
the correct mount option that will be accepted by parse_options() is
[no]init_itable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-12-12 22:06:18 -05:00
Rafael J. Wysocki 986b11c3ee Merge branch 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into pm-freezer
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
  freezer: fix wait_event_freezable/__thaw_task races
  freezer: kill unused set_freezable_with_signal()
  dmatest: don't use set_freezable_with_signal()
  usb_storage: don't use set_freezable_with_signal()
  freezer: remove unused @sig_only from freeze_task()
  freezer: use lock_task_sighand() in fake_signal_wake_up()
  freezer: restructure __refrigerator()
  freezer: fix set_freezable[_with_signal]() race
  freezer: remove should_send_signal() and update frozen()
  freezer: remove now unused TIF_FREEZE
  freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
  cgroup_freezer: prepare for removal of TIF_FREEZE
  freezer: clean up freeze_processes() failure path
  freezer: kill PF_FREEZING
  freezer: test freezable conditions while holding freezer_lock
  freezer: make freezing indicate freeze condition in effect
  freezer: use dedicated lock instead of task_lock() + memory barrier
  freezer: don't distinguish nosig tasks on thaw
  freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
  freezer: rename thaw_process() to __thaw_task() and simplify the implementation
  ...
2011-11-23 21:09:02 +01:00
Tejun Heo a0acae0e88 freezer: unexport refrigerator() and update try_to_freeze() slightly
There is no reason to export two functions for entering the
refrigerator.  Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.

* Rename refrigerator() to __refrigerator() and make it return bool
  indicating whether it scheduled out for freezing.

* Update try_to_freeze() to return bool and relay the return value of
  __refrigerator() if freezing().

* Convert all refrigerator() users to try_to_freeze().

* Update documentation accordingly.

* While at it, add might_sleep() to try_to_freeze().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
2011-11-21 12:32:22 -08:00
Richard Weinberger 2397256d62 ext4: Remove kernel_lock annotations
The BKL is gone, these annotations are useless.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:50:09 -05:00
Theodore Ts'o eb513689c9 ext4: ignore journalled data options on remount if fs has no journal
This avoids a confusing failure in the init scripts when the
/etc/fstab has data=writeback or data=journal but the file system does
not have a journal.  So check for this case explicitly, and warn the
user that we are ignoring the (pointless, since they have no journal)
data=* mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:47:42 -05:00
Eric Gouriou 6f91bc5fda ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.

In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.

Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:43:23 -04:00
Fabrice Jouhaud d44651d0f9 ext4: fix ext4 so it works without CONFIG_PROC_FS
This fixes a bug which was introduced in dd68314ccf.  The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.

Signed-off-by: Fabrice Jouhaud <yargil@free.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 16:26:03 -04:00
Lukas Czerner 4113c4caa4 ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.

This is a part of the effort to reduce number of ext4 options hence the
test matrix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:34:47 -04:00
Tao Ma dcf2d804ed ext4: Free resources in some error path in ext4_fill_super
Some of the error path in ext4_fill_super don't release the
resouces properly. So this patch just try to release them
in the right way.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 12:10:11 -04:00
Theodore Ts'o 5dee54372c ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:10:51 -04:00
Theodore Ts'o 021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Aditya Kali 7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o f975d6bcc7 ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:00:51 -04:00
Theodore Ts'o 24aaa8ef4e ext4: convert the free_blocks field in s_flex_groups to be free_clusters
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:58:51 -04:00
Theodore Ts'o 5704265188 ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:56:51 -04:00
Theodore Ts'o bab08ab964 ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl.  This simplifies the initial implementation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:36:51 -04:00
Theodore Ts'o 281b599597 ext4: read-only support for bigalloc file systems
This adds supports for bigalloc file systems.  It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:34:51 -04:00
Theodore Ts'o 7c2e70879f ext4: add ext4-specific kludge to avoid an oops after the disk disappears
The del_gendisk() function uninitializes the disk-specific data
structures, including the bdi structure, without telling anyone
else.  Once this happens, any attempt to call mark_buffer_dirty()
(for example, by ext4_commit_super), will cause a kernel OOPS.

Fix this for now until we can fix things in an architecturally correct
way.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:28:51 -04:00
Theodore Ts'o 56889787cf ext4: improve handling of conflicting mount options
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).

Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported.  Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.

Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03 18:22:38 -04:00
Jiaying Zhang 2581fdc810 ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-08-13 12:17:13 -04:00
Mathias Krause db9481c047 ext4: use kzalloc in ext4_kzalloc()
Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-03 14:57:11 -04:00
Theodore Ts'o f18a5f21c2 ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:38 -04:00
Theodore Ts'o 9933fc0ac1 ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:02 -04:00
Yongqiang Yang 8f82f840ec ext4: prevent parallel resizers by atomic bit ops
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.

To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:35:44 -04:00
Dan Ehrenberg 3eb0865843 ext4: ignore a stripe width of 1
If the stripe width was set to 1, then this patch will ignore
that stripe width and ext4 will act as if the stripe width
were 0 with respect to optimizing allocations.

Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-17 21:18:51 -04:00
Theodore Ts'o 12706394bc ext4: add tracepoint for ext4_journal_start
This will help debug who is responsible for starting a jbd2 transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-10 22:37:50 -04:00
Lukas Czerner f17722f917 ext4: Fix max file size and logical block counting of extent format file
Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
format and fill the tail of file up to its end. We will hit the BUG_ON
when we write the last block (2^32-1) into the sparse file.

The root cause of the problem lies in the fact that we specifically set
s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
which is 32 bit long. However, we are not storing start and end block
number, but rather start block number and length in blocks. It means
that in order to cover extent from 0 to EXT_MAX_BLOCK we need
EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
and it does not.

The only way to fix it without changing the meaning of the struct
ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
by one fs block so we can cover the whole extent we can get by the
on-disk extent format.

Also in many places EXT_MAX_BLOCK is used as length instead of maximum
logical block number as the name suggests, it is all a bit messy. So
this commit renames it to EXT_MAX_BLOCKS and change its usage in some
places to actually be maximum number of blocks in the extent.

The bug which this commit fixes can be reproduced as follows:

 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-2))
 sync
 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-1))

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-06 00:05:17 -04:00
Linus Torvalds f8d613e2a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
  xen: cleancache shim to Xen Transcendent Memory
  ocfs2: add cleancache support
  ext4: add cleancache support
  btrfs: add cleancache support
  ext3: add cleancache support
  mm/fs: add hooks to support cleancache
  mm: cleancache core ops functions and config
  fs: add field to superblock to support cleancache
  mm/fs: cleancache documentation

Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26 10:50:56 -07:00
Dan Magenheimer 7abc52c2ed ext4: add cleancache support
This seventh patch of eight in this cleancache series "opts-in"
cleancache for ext4.  Filesystems must explicitly enable cleancache
by calling cleancache_init_fs anytime an instance of the filesystem
is mounted. For ext4, all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:02:03 -06:00
Johann Lombardi c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Kazuya Mio d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Vivek Haldar 77f4135f2a ext4: count hits/misses of extent cache and expose in sysfs
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:24:16 -04:00
Theodore Ts'o 373cd5c53d ext4: don't show mount options in /proc/mounts if there is no journal
After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 16:12:35 -04:00
Lukas Czerner 1bb933fb1f ext4: fix possible use-after-free in ext4_remove_li_request()
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:29 -04:00
Lukas Czerner 51ce651156 ext4: fix the mount option "init_itable=n" to work as expected for n=0
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:16 -04:00
Lukas Czerner e1290b3e62 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:51 -04:00
Lukas Czerner 4ed5c033c1 ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:04 -04:00
Tao Ma ed3ce80a52 ext4: don't warn about mnt_count if it has been disabled
Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to
zero, and mount it for the first time, we will get the warning:

	maximal mount count reached, running e2fsck is recommended

It is really misleading. So change the check so that it won't warn in
that case.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:29:57 -04:00
Amir Goldstein 0b26859027 ext4: fix oops in ext4_quota_off()
If quota is not enabled when ext4_quota_off() is called, we must not
dereference quota file inode since it is NULL.  Check properly for
this.

This fixes a bug in commit 21f976975c (ext4: remove unnecessary
[cm]time update of quota file), which was merged for 2.6.39-rc3.

Reported-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-16 09:59:13 -04:00
Amerigo Wang 66bb82798d ext4: remove redundant #ifdef in super.c
There is already an #ifdef CONFIG_QUOTA some lines above,
so this one is totally useless.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:30:41 -04:00
Tao Ma 55ff3840a2 ext4: remove redundant check for first_not_zeroed in ext4_register_li_request
We have checked first_not_zeroed == ngroups already above, so remove
this redundant check.

sbi->s_li_request = NULL above is also removed since it is NULL
already.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:28:41 -04:00
Theodore Ts'o 2035e77605 ext4: check for ext[23] file system features when mounting as ext[23]
Provide better emulation for ext[23] mode by enforcing that the file
system does not have any unsupported file system features as defined
by ext[23] when emulating the ext[23] file system driver when
CONFIG_EXT4_USE_FOR_EXT23 is defined.

This causes the file system type information in /proc/mounts to be
correct for the automatically mounted root file system.  This also
means that "mount -t ext2 /dev/sda /mnt" will fail if /dev/sda
contains an ext3 or ext4 file system, just as one would expect if the
original ext2 file system driver were in use.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-18 17:29:14 -04:00
Linus Torvalds a97b52022a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix data corruption regression by reverting commit 6de9843dab
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section
2011-04-11 15:45:47 -07:00
Yongqiang Yang be4f27d324 ext4: allow an active handle to be started when freezing
ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen.  Otherwise, deadlock is easy to happen, below
is a situation.

================================================
     freeze         |       truncate
================================================
                    |  ext4_ext_truncate()
    freeze_super()  |   starts a handle
    sets s_frozen   |
                    |  ext4_ext_truncate()
                    |  holds i_data_sem
  ext4_freeze()     |
  waits for updates |
                    |  ext4_free_blocks()
                    |  calls dquot_free_block()
                    |
                    |  dquot_free_blocks()
                    |  calls ext4_dirty_inode()
                    |
                    |  ext4_dirty_inode()
                    |  trys to start an active
                    |  handle
                    |
                    |  block due to s_frozen
================================================

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@users.sf.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2011-04-10 22:06:07 -04:00
Tao Ma 0449641130 ext4: init timer earlier to avoid a kernel panic in __save_error_info
During mount, when we fail to open journal inode or root inode, the
__save_error_info will mod_timer. But actually s_err_report isn't
initialized yet and the kernel oops. The detailed information can
be found https://bugzilla.kernel.org/show_bug.cgi?id=32082.

The best way is to check whether the timer s_err_report is initialized
or not. But it seems that in include/linux/timer.h, we can't find a
good function to check the status of this timer, so this patch just
move the initializtion of s_err_report earlier so that we can avoid
the kernel panic. The corresponding del_timer is also added in the
error path.

Reported-by: Sami Liedes <sliedes@cc.hut.fi>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-05 19:55:28 -04:00
Tao Ma 46e4690bbd ext4: fix a double free in ext4_register_li_request
In ext4_register_li_request, we malloc a ext4_li_request and
inserts it into ext4_li_info->li_request_list. In case of any
error later, we free it in the end.  But if we have some error
in ext4_run_lazyinit_thread, the whole li_request_list will be
dropped and freed in it. So we will double free this ext4_li_request.

This patch just sets elr to NULL after it is inserted to the list
so that the latter kfree won't double free it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-04-04 16:00:49 -04:00
Jan Kara 21f976975c ext4: remove unnecessary [cm]time update of quota file
It is not necessary to update [cm]time of quota file on each quota
file write and it wastes journal space and IO throughput with inode
writes. So just remove the updating from ext4_quota_write() and only
update times when quotas are being turned off. Userspace cannot get
anything reliable from quota files while they are used by the kernel
anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-04 15:33:39 -04:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds ae005cbed1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
  ext4: fix a BUG in mb_mark_used during trim.
  ext4: unused variables cleanup in fs/ext4/extents.c
  ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
  ext4: add more tracepoints and use dev_t in the trace buffer
  ext4: don't kfree uninitialized s_group_info members
  ext4: add missing space in printk's in __ext4_grp_locked_error()
  ext4: add FITRIM to compat_ioctl.
  ext4: handle errors in ext4_clear_blocks()
  ext4: unify the ext4_handle_release_buffer() api
  ext4: handle errors in ext4_rename
  jbd2: add COW fields to struct jbd2_journal_handle
  jbd2: add the b_cow_tid field to journal_head struct
  ext4: Initialize fsync transaction ids in ext4_new_inode()
  ext4: Use single thread to perform DIO unwritten convertion
  ext4: optimize ext4_bio_write_page() when no extent conversion is needed
  ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
  ext4: use the nblocks arg to ext4_truncate_restart_trans()
  ext4: fix missing iput of root inode for some mount error paths
  ext4: make FIEMAP and delayed allocation play well together
  ext4: suppress verbose debugging information if malloc-debug is off
  ...

Fi up conflicts in fs/ext4/super.c due to workqueue changes
2011-03-25 09:57:41 -07:00
Robin Dong 21149d611e ext4: add missing space in printk's in __ext4_grp_locked_error()
When we do performence-testing on ext4 filesystem, we observed a
warning like this:

EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd

instead, it should be

"group 2598, 25901 blocks in bitmap, 26057 in gd"

Reviewed-by: Coly Li <bosong.ly@taobao.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 20:39:22 -04:00
Linus Torvalds bd2895eead Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix build failure introduced by s/freezeable/freezable/
  workqueue: add system_freezeable_wq
  rds/ib: use system_wq instead of rds_ib_fmr_wq
  net/9p: replace p9_poll_task with a work
  net/9p: use system_wq instead of p9_mux_wq
  xfs: convert to alloc_workqueue()
  reiserfs: make commit_wq use the default concurrency level
  ocfs2: use system_wq instead of ocfs2_quota_wq
  ext4: convert to alloc_workqueue()
  scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
  scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
  misc/iwmc3200top: use system_wq instead of dedicated workqueues
  i2o: use alloc_workqueue() instead of create_workqueue()
  acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
  fs/aio: aio_wq isn't used in memory reclaim path
  input/tps6507x-ts: use system_wq instead of dedicated workqueue
  cpufreq: use system_wq instead of dedicated workqueues
  wireless/ipw2x00: use system_wq instead of dedicated workqueues
  arm/omap: use system_wq in mailbox
  workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
2011-03-16 08:20:19 -07:00
Aneesh Kumar K.V f2fa2ffc20 ext4: Copy fs UUID to superblock
File system UUID is made available to application
via  /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Mingming Cao 198868f35d ext4: Use single thread to perform DIO unwritten convertion
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch.  Per
filesystem is enough, we don't need per cpu threads to work on
conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2011-03-05 11:52:45 -05:00
Amir Goldstein d39195c33b ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.

This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-28 00:53:45 -05:00
Manish Katiyar 32a9bb57d7 ext4: fix missing iput of root inode for some mount error paths
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.

https://bugzilla.kernel.org/show_bug.cgi?id=26752

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 20:42:06 -05:00
Theodore Ts'o 6fd7a46781 ext4: enable mblk_io_submit by default
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 13:53:09 -05:00
Eric Sandeen ea66333694 ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.

Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel.  At some point
we can start to deprecate the options, perhaps.

For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:51:51 -05:00
Lukas Czerner 0b75a84012 ext4: mark file-local functions and variables as static
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov 5dbd571d87 ext4: allow inode_readahead_blks=0 (linux-2.6.37)
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):

# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks 
bash: echo: write error: Invalid argument

On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.

This patch fixes the problem by checking for zero explicitly.

Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:33:21 -05:00
Peter Huewe 7dc576158d ext4: Fix sparse warning: Using plain integer as NULL pointer
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:01:42 -05:00
Tejun Heo 43d133c18b Merge branch 'master' into for-2.6.39 2011-02-21 09:43:56 +01:00
Eric Sandeen e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
Theodore Ts'o dd68314ccf ext4: fix up ext4 error handling
Make sure we the correct cleanup happens if we die while trying to
load the ext4 file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:49 -05:00
Lukas Czerner 8f021222c1 ext4: unregister features interface on module unload
Ext4 features interface was not properly unregistered which led to
problems while unloading/reloading ext4 module. This commit fixes that by
adding proper kobject unregistration code into ext4_exit_fs() as well as
fail-path of ext4_init_fs()

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-02-03 14:33:33 -05:00
Eric Sandeen 8f1f745331 ext4: fix panic on module unload when stopping lazyinit thread
https://bugzilla.kernel.org/show_bug.cgi?id=27652

If the lazyinit thread is running, the teardown function
ext4_destroy_lazyinit_thread() has problems:

        ext4_clear_request_list();
        while (ext4_li_info->li_task) {
                wake_up(&ext4_li_info->li_wait_daemon);
                wait_event(ext4_li_info->li_wait_task,
                           ext4_li_info->li_task == NULL);
        }

Clearing the request list will cause the thread to exit and free
ext4_li_info, so then we're waiting on something which is getting
freed.

Fix this up by making the thread respond to kthread_stop, and exit,
without the need to wait for that exit in some other homegrown way.

Cc: stable@kernel.org
Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:15 -05:00
Tejun Heo fd89d5f203 ext4: convert to alloc_workqueue()
Convert create_workqueue() to alloc_workqueue().  This is an identity
conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
2011-02-01 11:42:42 +01:00
Linus Torvalds 4843456c5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix deadlock during path resolution
2011-01-21 07:33:37 -08:00
Linus Torvalds 275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Jan Kara f00c9e44ad quota: Fix deadlock during path resolution
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-12 19:14:55 +01:00
Linus Torvalds e9688f6aca Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: fix trimming starting with block 0 with small blocksize
  ext4: revert buggy trim overflow patch
  ext4: don't pass entire map to check_eofblocks_fl
  ext4: fix memory leak in ext4_free_branches
  ext4: remove ext4_mb_return_to_preallocation()
  ext4: flush the i_completed_io_list during ext4_truncate
  ext4: add error checking to calls to ext4_handle_dirty_metadata()
  ext4: fix trimming of a single group
  ext4: fix uninitialized variable in ext4_register_li_request
  ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
  ext4: drop i_state_flags on architectures with 64-bit longs
  ext4: reorder ext4_inode_info structure elements to remove unneeded padding
  ext4: drop ec_type from the ext4_ext_cache structure
  ext4: use ext4_lblk_t instead of sector_t for logical blocks
  ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
  ext4: fix 32bit overflow in ext4_ext_find_goal()
  ext4: add more error checks to ext4_mkdir()
  ext4: ext4_ext_migrate should use NULL not 0
  ext4: Use ext4_error_file() to print the pathname to the corrupted inode
  ext4: use IS_ERR() to check for errors in ext4_error_file
  ...
2011-01-11 14:37:31 -08:00
Andrew Morton 6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o 8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Theodore Ts'o f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Nick Piggin fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Joe Perches 0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches 94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Theodore Ts'o a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o 673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Theodore Ts'o 1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Lukas Czerner 93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Darrick J. Wong 5a9ae68a34 ext4: ext4_fill_super shouldn't return 0 on corruption
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret.  However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail.  This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.

A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 09:56:44 -05:00
Dan Carpenter f4c8cc652d ext4: missing unlock in ext4_clear_request_list()
If the the li_request_list was empty then it returned with the lock
held.  Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:25 -05:00
Tejun Heo d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Theodore Ts'o 7ff9c073dd ext4: Add new ext4 inode tracepoints
Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
ext4_begin_ordered_truncate()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:51:33 -05:00
Dmitry Monakhov 87009d86dc ext4: do not try to grab the s_umount semaphore in ext4_quota_off
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2010-11-08 13:47:33 -05:00