Commit Graph

818 Commits

Author SHA1 Message Date
Chao Yu 3aab8f828e f2fs: introduce f2fs_write_failed to handle error case when write
When we fail in ->write_begin()/->direct_IO(), our allocated node block in disk
and page cache are still kept, despite these may not be used again.

This patch introduce f2fs_write_failed() to handle the error case of these two
interfaces, it will truncate page cache and blocks of this file according to
i_size.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng eee6160f2e f2fs: arguments cleanup of finding file flow functions
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng 1c3bb97899 f2fs: remove the needless point-cast
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng 34e6d456da f2fs: remove the redundant validation check of acl
kernel side(xx_init_acl), the acl is get/cloned from the parent dir's,
which is credible. So remove the redundant validation check of acl
here.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu 1256010ab1 f2fs: reduce region of f2fs_lock_op covered for better concurrency
In our rename process, region of f2fs_lock_op covered is too big as some of the
code like f2fs_empty_dir/f2fs_find_entry are not needed to protect by this lock.

So in the extreme case like doing checkpoint when we rename old inode to exist
inode in a large directory could cause lower concurrency.

Let's reduce the region of f2fs_lock_op to fix this.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Fabian Frederick b434babf85 f2fs: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu aec71382c6 f2fs: refactor flush_nat_entries codes for reducing NAT writes
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
   nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
   journal is full, then flush the left dirty entries to disk without merge
   journaled entries, so these journaled entries may be flushed to disk at next
   checkpoint but lost chance to flushed last time.

In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.

In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.

1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
		node num	cp count	nodes/cp
based		4599.6		1803.0		2.551
patched		2714.6		1829.6		1.483

2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S

		node num	cp count	nodes/cp
based		84.5		43.7		1.933
patched		49.2		40.0		1.23

Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns)	dirty nat count
3089219		24922
5129423		27422
4000250		24523

change log from v1:
 o fix wrong logic in add_nat_entry when grab a new nat entry set.
 o swith to create slab cache in create_node_manager_caches.
 o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.

change log from v2:
 o make comment position more appropriate suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim a014e037be f2fs: clean up an unused parameter and assignment
This patch cleans up simple unnecessary codes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim b97a9b5da8 f2fs: introduce f2fs_do_tmpfile for code consistency
This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
flow.
Throught this, we can provide the consistent lock usage, e.g., fi->i_sem,  and
this will enable better debugging stuffs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu 50732df02e f2fs: support ->tmpfile()
Add function f2fs_tmpfile() to support O_TMPFILE file creation, and modify logic
of init_inode_metadata to enable linkat temp file.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu ca0a81b397 f2fs: avoid to truncate non-updated page partially
After we call find_data_page in truncate_partial_data_page, we could not
guarantee this page is updated or not as error may occurred in lower layer.

We'd better check status of the page to avoid this no updated page be
writebacked to device.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu 5576cd6ca5 f2fs: avoid unneeded SetPageUptodate in f2fs_write_end
We have already set page update in ->write_begin, so we should remove redundant
SetPageUptodate in ->write_end.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu 50e1f8d221 f2fs: avoid to access NULL pointer in issue_flush_thread
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=75861

Denis 2014-05-10 11:28:59 UTC reported:
"F2FS-fs (mmcblk0p28): mounting..
 Unable to handle kernel NULL pointer dereference at virtual address 00000018
 ...
 [<c0a2f678>] (_raw_spin_lock+0x3c/0x70) from [<c03a0330>] (issue_flush_thread+0x50/0x17c)
 [<c03a0330>] (issue_flush_thread+0x50/0x17c) from [<c01b4064>] (kthread+0x98/0xa4)
 [<c01b4064>] (kthread+0x98/0xa4) from [<c0108060>] (kernel_thread_exit+0x0/0x8)"

This patch assign cmd_control_info in sm_info before issue_flush_thread is being
created, so this make sure that issue flush thread will have no chance to access
invalid info in fcc.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:55 -07:00
Jaegeuk Kim 2743f86554 f2fs: check bdi->dirty_exceeded when trying to skip data writes
If we don't check the current backing device status, balance_dirty_pages can
fall into infinite pausing routine.

This can be occurred when a lot of directories make a small number of dirty
dentry pages including files.

Reported-by: Brian Chadwick <brianchad@westnet.com.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:45 -07:00
Jaegeuk Kim b2c0829912 f2fs: do checkpoint for the renamed inode
If an inode is renamed, it should be registered as file_lost_pino to conduct
checkpoint at f2fs_sync_file.
Otherwise, the inode cannot be recovered due to no dent_mark in the following
scenario.

Note that, this scenario is from xfstests/322.

1. create "a"
2. fsync "a"
3. rename "a" to "b"
4. fsync "b"
5. Sudden power-cut

After recovery is done, "b" should be seen.
However, the result shows "a", since the recovery procedure does not enter
recover_dentry due to no dent_mark.

The reason is like below.
- The nid of "a" is checkpointed during #2, f2fs_sync_file.
- The inode page for "b" produced by #3 is written without dent_mark by
sync_node_pages.

So, this patch fixes this bug by assinging file_lost_pino to the "a"'s inode.
If the pino is lost, f2fs_sync_file conducts checkpoint, and then recovers
the latest pino and its dentry information for further recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:31 -07:00
Chao Yu dd4d961fe7 f2fs: release new entry page correctly in error path of f2fs_rename
This patch correct releasing code of new_page to avoid BUG_ON in error patch of
f2fs_rename.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:11 -07:00
Chao Yu 90d72459cc f2fs: fix error path in init_inode_metadata
If we fail in this path:
->init_inode_metadata
  ->make_empty_dir
    ->get_new_data_page
      ->grab_cache_page return -ENOMEM

We will bug on in error path of init_inode_metadata when call remove_inode_page
because i_block = 2 (one inode block will be released later & one dentry block).

We should release the dentry block in init_inode_metadata to avoid this BUG_ON,
and avoid leak of dentry block resource, because we never have second chance to
release that block in ->evict_inode as in upper error path we make this inode
'bad'.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:50 -07:00
Chao Yu d6b7d4b31d f2fs: check lower bound nid value in check_nid_range
This patch add lower bound verification for nid in check_nid_range, so nids
reserved like 0, node, meta passed by caller could be checked there.

And then check_nid_range could be used in f2fs_nfs_get_inode for simplifying
code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:08 -07:00
Chao Yu 8bc6f60e3f f2fs: remove unused variables in f2fs_sm_info
Remove unused variables in struct f2fs_sm_info.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:57:57 -07:00
Jaegeuk Kim 98397ff3cd f2fs: fix not to allocate unnecessary blocks during fallocate
This patch fixes the fallocate bug like below. (See xfstests/255)

In fallocate(fd, 0, 20480),
expand_inode_data processes
	for (index = pg_start; index <= pg_end; index++) {
		f2fs_reserve_block();
		...
	}

So, even though fallocate requests 20480, 5 blocks, f2fs allocates 6 blocks
including pg_end.
So, this patch adds one condition to avoid block allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim ead432756a f2fs: recover fallocated data and its i_size together
This patch arranges the f2fs_locks to cover the fallocated data and its i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim ccfb30001f f2fs: fix to report newly allocate region as extent
Previous get_block in f2fs didn't report the newly allocated region which has
NEW_ADDR.
For reader, it should not report, but fiemap needs this.
So, this patch introduces two get_block sharing core function.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:07 +09:00
Linus Torvalds 16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Al Viro 8d0207652c ->splice_write() via ->write_iter()
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter().  A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:18:51 -04:00
Linus Torvalds 64b2d1fbbf f2fs updates for v3.16
This patch-set includes the following major enhancement patches.
  o enhance wait_on_page_writeback
  o support SEEK_DATA and SEEK_HOLE
  o enhance readahead flows
  o enhance IO flushes
  o support fiemap
  o add some tracepoints
 
 The other bug fixes are as follows.
  o fix to support a large volume > 2TB correctly
  o recovery bug fix wrt fallocated space
  o fix recursive lock on xattr operations
  o fix some cases on the remount flow
 
 And, there are a bunch of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTleLYAAoJEEAUqH6CSFDSdhEP/iY5hTZ02jH4ZiFPP/Nee4hS
 l0BHeZvrMjoccaWUDu0ZvIPC8BiJ7kLOgK+/VWS7LAfY1PY11ALEtYQOrW+RM47+
 jMfULegTod/F8WS2B6dk31QMhAZldtnsYvA5PS1VV3S0rht+qbOz+PDejZFgSMc3
 VVQ7OZzq30gMmtsw7oh3FHfeTu4xe/bxygdMRXgljQQD2MvorJiOb4MA+ovEDd9z
 CZMMTQvRKjc0d8LPf0gOkZEvG63GR6klCgFMuiappUsua0O8IPIEhCyEGFrE66vS
 fObVKpqNAsR2ABhS2grgn6Q7UUvn4xrF6jOwDH5tuw2yzmkQiMAWINwBdgnbEy1c
 D5S89PQ8TkQK9KBSoU0v8WKWC4HzWZF4ZEd6eq9VxVTS8iT0w8DtNHXK518FVC34
 82iqrLc0EhrcGNiW/7Nrc6WzHrWqorylCFN7atB3ruhVqeMh3MZsDU4Gq0WgB2oh
 pF9XVBEpJQpV35DYSAPzLkm+2+mwHVNqwdY3HcvUs7IP90+wZlrWSRG2FEfFt/e8
 6nwvbORrHMTI5VfdYlEPgpjuesFmYPZqEvRGdaDa1YrHqhvvgStEPT9qiq2qVn9+
 tr0HjpNRDD/etkaE6ujYOYqdxuk3mm6RY68h+KSbNcY1/VTvt2bN2telwWuDsxIq
 8yOsxV2x3JB/euDAJsSU
 =xqsO
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, there is no special interesting feature, but we've
  investigated a couple of tuning points with respect to the I/O flow.
  Several major bug fixes and a bunch of clean-ups also have been made.

  This patch-set includes the following major enhancement patches:
   - enhance wait_on_page_writeback
   - support SEEK_DATA and SEEK_HOLE
   - enhance readahead flows
   - enhance IO flushes
   - support fiemap
   - add some tracepoints

  The other bug fixes are as follows:
   - fix to support a large volume > 2TB correctly
   - recovery bug fix wrt fallocated space
   - fix recursive lock on xattr operations
   - fix some cases on the remount flow

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
  f2fs: support f2fs_fiemap
  f2fs: avoid not to call remove_dirty_inode
  f2fs: recover fallocated space
  f2fs: fix to recover data written by dio
  f2fs: large volume support
  f2fs: avoid crash when trace f2fs_submit_page_mbio event in ra_sum_pages
  f2fs: avoid overflow when large directory feathure is enabled
  f2fs: fix recursive lock by f2fs_setxattr
  MAINTAINERS: add a co-maintainer from samsung for F2FS
  MAINTAINERS: change the email address for f2fs
  f2fs: use inode_init_owner() to simplify codes
  f2fs: avoid to use slab memory in f2fs_issue_flush for efficiency
  f2fs: add a tracepoint for f2fs_read_data_page
  f2fs: add a tracepoint for f2fs_write_{meta,node,data}_pages
  f2fs: add a tracepoint for f2fs_write_{meta,node,data}_page
  f2fs: add a tracepoint for f2fs_write_end
  f2fs: add a tracepoint for f2fs_write_begin
  f2fs: fix checkpatch warning
  f2fs: deactivate inode page if the inode is evicted
  f2fs: decrease the lock granularity during write_begin
  ...
2014-06-09 19:11:44 -07:00
Jaegeuk Kim 9ab7013492 f2fs: support f2fs_fiemap
This patch links f2fs_fiemap with generic function with get_block.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-08 08:56:49 +09:00
Jaegeuk Kim 86928f984e f2fs: avoid not to call remove_dirty_inode
There is an errorneous case during the recovery like below.

In recovery_dentry,
 1) dir = f2fs_iget();
 2) mark the dir with FI_DELAY_IPUT
 3) goto unmap_out

After the end of recovery routine, there is no dirty dentries so the dir cannot
be released by iput in remove_dirty_dir_inode.

This patch fixes such the bug case by handling the iget and iput in the
recovery_dentry procedure.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-07 03:18:36 +09:00
Jaegeuk Kim 6fa1df533a f2fs: recover fallocated space
If a fallocated file is fsynced, we should recover the i_size after sudden
power cut.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-07 03:18:35 +09:00
Mel Gorman 2457aec637 mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:10 -07:00
Jaegeuk Kim b6fe5873cb f2fs: fix to recover data written by dio
If data are overwritten through dio, previous f2fs doesn't remain the fsync mark
due to no additional node writes.

Note that this patch should resolve the xfstests:311.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-04 18:41:38 +09:00
Changman Lee 1dbe415216 f2fs: large volume support
f2fs's cp has one page which consists of struct f2fs_checkpoint and
version bitmap of sit and nat. To support lots of segments, we need more
blocks for sit bitmap. So let's arrange sit bitmap as following:
+-----------------+------------+
| f2fs_checkpoint | sit bitmap |
| + nat bitmap    |            |
+-----------------+------------+
0                 4k        N blocks

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: simple code change for readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-04 13:34:30 +09:00
Chao Yu bac4eef653 f2fs: avoid crash when trace f2fs_submit_page_mbio event in ra_sum_pages
Previously we allocate pages with no mapping in ra_sum_pages(), so we may
encounter a crash in event trace of f2fs_submit_page_mbio where we access
mapping data of the page.

We'd better allocate pages in bd_inode mapping and invalidate these pages after
we restore data from pages. It could avoid crash in above scenario.

Changes from V1
 o remove redundant code in ra_sum_pages() suggested by Jaegeuk Kim.

Call Trace:
 [<f1031630>] ? ftrace_raw_event_f2fs_write_checkpoint+0x80/0x80 [f2fs]
 [<f10377bb>] f2fs_submit_page_mbio+0x1cb/0x200 [f2fs]
 [<f103c5da>] restore_node_summary+0x13a/0x280 [f2fs]
 [<f103e22d>] build_curseg+0x2bd/0x620 [f2fs]
 [<f104043b>] build_segment_manager+0x1cb/0x920 [f2fs]
 [<f1032c85>] f2fs_fill_super+0x535/0x8e0 [f2fs]
 [<c115b66a>] mount_bdev+0x16a/0x1a0
 [<f102f63f>] f2fs_mount+0x1f/0x30 [f2fs]
 [<c115c096>] mount_fs+0x36/0x170
 [<c1173635>] vfs_kern_mount+0x55/0xe0
 [<c1175388>] do_mount+0x1e8/0x900
 [<c1175d72>] SyS_mount+0x82/0xc0
 [<c16059cc>] sysenter_do_call+0x12/0x22

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-04 13:34:30 +09:00
Chao Yu bfec07d0f8 f2fs: avoid overflow when large directory feathure is enabled
When large directory feathure is enable, We have one case which could cause
overflow in dir_buckets() as following:
special case: level + dir_level >= 32 and level < MAX_DIR_HASH_DEPTH / 2.

Here we define MAX_DIR_BUCKETS to limit the return value when the condition
could trigger potential overflow.

Changes from V1
 o modify description of calculation in f2fs.txt suggested by Changman Lee.

Suggested-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-04 13:34:30 +09:00
Jaegeuk Kim d631abdac9 f2fs: fix recursive lock by f2fs_setxattr
This patch should resolve the following recursive lock.

[<ffffffff8135a9c3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffa01749dc>] f2fs_setxattr+0x5c/0xa0 [f2fs]
[<ffffffffa0174c99>] __f2fs_set_acl+0x1b9/0x340 [f2fs]
[<ffffffffa017515a>] f2fs_init_acl+0x4a/0xcb [f2fs]
[<ffffffffa0159abe>] __f2fs_add_link+0x26e/0x780 [f2fs]
[<ffffffffa015d4d8>] f2fs_mkdir+0xb8/0x150 [f2fs]
[<ffffffff811cebd7>] vfs_mkdir+0xb7/0x160
[<ffffffff811cf89b>] SyS_mkdir+0xab/0xe0
[<ffffffff817244bf>] tracesys+0xe1/0xe6
[<ffffffffffffffff>] 0xffffffffffffffff

The call path indicates:
- f2fs_add_link
   : down_write(&fi->i_sem);

 - init_inode_metadata
   - f2fs_init_acl
     - __f2fs_set_acl
       - f2fs_setxattr
         : down_write(&fi->i_sem);

Here we should not call f2fs_setxattr, but __f2fs_setxattr.
But __f2fs_setxattr is a static function in xattr.c, so that I found the other
generic approach to use f2fs_setxattr.

In f2fs_setxattr, the page pointer is only given from init_inode_metadata.
So, this patch adds this condition to avoid this in f2fs_setxattr.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-02 22:13:16 +09:00
Chao Yu 70ff5dfeb6 f2fs: use inode_init_owner() to simplify codes
This patch uses exported inode_init_owner() to simplify codes in
f2fs_new_inode().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-08 18:23:21 +09:00
Chao Yu adf8d90b6a f2fs: avoid to use slab memory in f2fs_issue_flush for efficiency
If we use slab memory in f2fs_issue_flush(), we will face memory pressure and
latency time caused by racing of kmem_cache_{alloc,free}.

Let's alloc memory in stack instead of slab.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-08 18:23:21 +09:00
Chao Yu c20e89cde6 f2fs: add a tracepoint for f2fs_read_data_page
This patch adds a tracepoint for f2fs_read_data_page to trace when page is
readed by user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu e574843438 f2fs: add a tracepoint for f2fs_write_{meta,node,data}_pages
This patch adds a tracepoint for f2fs_write_{meta,node,data}_pages to trace when
pages are fsyncing/flushing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu ecda0de343 f2fs: add a tracepoint for f2fs_write_{meta,node,data}_page
This patch adds a tracepoint for f2fs_write_{meta,node,data}_page to trace when
page is writting out.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu dfb2bf38bf f2fs: add a tracepoint for f2fs_write_end
This patch adds a tracepoint for f2fs_write_end to trace write op of user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Chao Yu 62aed044ea f2fs: add a tracepoint for f2fs_write_begin
This patch adds a tracepoint for f2fs_write_begin to trace write op of user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Zhang Zhen 8b376249e7 f2fs: fix checkpatch warning
fix the following checkpatch warning:
WARNING: do {} while (0) macros should not be semicolon terminated

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:59 +09:00
Jaegeuk Kim 8198899b94 f2fs: deactivate inode page if the inode is evicted
If the inode page is clean during its inode eviction, it'd better drop the page
to reduce further memory pressure.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim d5f66990bb f2fs: decrease the lock granularity during write_begin
This patch reduces the lock granularity during write_begin.
When the system is under memory pressure, it would be better to reduce
the locking time for the data pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim bde446866c f2fs: no need to wait on page writebck to meta pages
This patch removes grab_cache_page_write_begin for meta pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim 9ac1349ad7 f2fs: avoid grab_cache_page_write_begin for data pages
We don't need to wait on page writeback for these cases.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Jaegeuk Kim 54b591dfda f2fs: split grab_cache_page and wait_on_page_writeback for node pages
This patch splits grab_cache_page_write_begin into grab_cache_page and
wait_on_page_writeback for node pages.

This patch intends to enhance the latency to get node pages by alleviating
unnecessary wait_on_page_writeback.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Chao Yu 8aa6f1c5bd f2fs: fix to truncate inline data in inode page when setattr
Previous we do not truncate inline data in inode page when setattr, so following
case could still read the inline data which has already truncated:

1.write inline data
2.ftruncate size to 0
3.ftruncate size to max inline data size
4.read from offset 0

This patch introduces truncate_inline_data() to fix this problem.

change log from v1:
 o fix a bug and do not truncate first page data after truncate inline data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:58 +09:00
Chao Yu 817202d937 f2fs: readahead multi pages of directory for performance
We have no so such readahead mechanism in ->iterate() path as the one in
->read() path, it cause low performance when we read large directory.
This patch add readahead in f2fs_readdir() for better performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Chao Yu 5c1f9927ec f2fs: set errno when f2fs_iget failed in recover_dentry
We should set the error number correctly when we fail in recover_dentry(), so
the recover flow could stop for the reason as error number shows instead of
continuing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Jaegeuk Kim 7f7670fe9f f2fs: consider fallocated space for SEEK_DATA
If an amount of data are allocated though fallocate and user writes a couple of
data among the space, f2fs should return the data offset made by user when
SEEK_DATA is requested.

For example, (N: NEW_ADDR by fallocate, X: NEW_ADDR by user)
1) fallocate 0 ~ 10MB
f -> N N N N N N N N N N N N ... N

2) write 4KB at 5MB offset
f -> N N N N N X N N N N N N ... N

3) SEEK_DATA from 0 should return 5MB offset

So, this patch adds a routine to search the first dirty page to handle that.
Then, the SEEK_DATA flow skips NEW_ADDR offsets until any dirty page is found.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Jaegeuk Kim fe369bc8ba f2fs: return i_size if the hole is outside of i_size
When SEEK_HOLE is requeted, it should return i_size if the hole position is
found outside of i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Chao Yu 267378d4de f2fs: introduce f2fs_seek_block to support SEEK_{DATA, HOLE} in llseek
In This patch we introduce f2fs_seek_block to support SEEK_{DATA,HOLE} of
lseek(2).

change log from v1:
 o fix bug when lseek from middle of page and fix wrong calculation of
PGOFS_OF_NEXT_DNODE macro.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Gu Zheng 2163d19815 f2fs: introduce help function {create,destroy}_flush_cmd_control
Introduce help function {create,destroy}_flush_cmd_control to clean up
the create/destory flush merge operation.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:57 +09:00
Gu Zheng a688b9d9e5 f2fs: introduce struct flush_cmd_control to wrap the flush_merge fields
Split the flush_merge fields from sm_i, and use the new struct flush_cmd_control
to wrap it, so that we can igonre these fileds if flush_merge is disable, and
it alse can the structs more neat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Chao Yu 6403eb1f64 f2fs: introduce help macro ADDRS_PER_PAGE()
Introduce help macro ADDRS_PER_PAGE() to get the number of address pointers in
direct node or inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim 2aea39eca6 f2fs: submit bio at the reclaim path
If f2fs_write_data_page is called through the reclaim path, we should submit
the bio right away.

This patch resolves the following issue that Marc Dietrich reported.
"It took me a while to bisect a problem which causes my ARM (tegra2) netbook to
frequently stall for 5-10 seconds when I enable EXA acceleration (opentegra
experimental ddx)."
And this patch fixes that.

Reported-by: Marc Dietrich <marvin24@gmx.de>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim 916decbf39 f2fs: return errors right after checking them
This patch adds two error conditions early in the setxattr operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim c02745ef68 f2fs: pass flags field to setxattr functions
This patch passes the "flags" field to the low level setxattr functions
to use XATTR_REPLACE in the following patches.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim e112326805 f2fs: clean up long variable names
This patch includes simple clean-ups to reduce unnecessary long variable names.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Chao Yu 454ae7e519 f2fs: handle inline data independently in f2fs_bmap
We'd better handle inline data case independently in f2fs_bmap().
It can reduce our handling time in f2fs_bmap().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:56 +09:00
Jaegeuk Kim 6fb03f3a40 f2fs: adjust free mem size to flush dentry blocks
If so many dirty dentry blocks are cached, not reached to the flush condition,
we should fall into livelock in balance_dirty_pages.
So, let's consider the mem size for the condition.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Jaegeuk Kim e8271fa390 f2fs: avoid BUG_ON when mouting corrupted image having garbage blocks
If the disk has some garbage blocks, F2FS is able to face with BUG_ON when
recovering direct node blocks.
This patch detects the error case and avoids that prior to reaching BUG_ON.

Alexey Khoroshilov addressed the potential security issues as follows.
"An ability to trigger a BUG_ON assert by mounting a crafted image is
usually considered as a local denial of service [1-3]. As far as I
understand, the reason is that some kernel data may become inconsistent
that can lead to further problems.

[1] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-3353
[2] http://www.openwall.com/lists/oss-security/2011/06/24/4
[3] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-2928
etc."

Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Jaegeuk Kim 7ee0eeabcd f2fs: add available_nids to fix handling max_nid correctly
This patch introduces available_nids for alloc_nids() and fixes max_nid for
build_free_nids() and scan_nat_pages().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Fabian Frederick b49ad51e6d f2fs: add static to get_max_meta_blks
inline get_max_meta_blks is only used in checkpoint.c
Use standard static inline format.

Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Chao Yu 94dac22e72 f2fs: introduce raw_nat_from_node_info() to simplfy codes
This patch introduce raw_nat_from_node_info() to simplfy some codes, and also
use exist function node_info_from_raw_nat() to do the same job.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Gu Zheng 876dc59eb1 f2fs: add the flush_merge handle in the remount flow
Add the *remount* handle of flush_merge option, so that the users
can enable flush_merge in the runtime, such as the underlying device
handles the cache_flush command relatively slowly.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:55 +09:00
Zhang Zhen 8abfb36ab3 f2fs: atomically set inode->i_flags in f2fs_set_inode_flags()
Use set_mask_bits() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
FS_IMMUTABLE_FL, FS_APPEND_FL, etc. flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jingoo Han b156d54241 f2fs: make recover_inline_xattr() static
Make recover_inline_xattr() static, because this function is
used only in this file.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim ed57c27f73 f2fs: remove costly dirty_dir_inode operations
This patch removes list opeations in handling dirty dir inodes.
Previously, F2FS traverses whole the list of dirty dir inodes to check whether
there is an existing inode or not, resulting in heavy CPU overheads.

So this patch removes such the traverse operations by adding FI_DIRTY_DIR to
indicate the inode lies on the list or not.
Through this simple flag, we can remove redundant operations gracefully.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim 15c6e3aae6 f2fs: fix to unlock f2fs_lock at the omitted error case
If it occurs an error, we should call f2fs_unlock_op.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim 76f60268e7 f2fs: call redirty_page_for_writepage
This patch replace some general codes with redirty_page_for_writepage, which
can be enabled after consideration on additional procedure like counting dirty
pages appropriately.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Jaegeuk Kim 1e87a78d95 f2fs: avoid to conduct roll-forward due to the remained garbage blocks
The f2fs always scans the next chain of direct node blocks.
But some garbage blocks are able to be remained due to no discard support or
SSR triggers.
This occasionally wreaks recovering wrong inodes that were used or BUG_ONs
due to reallocating node ids as follows.

When mount this f2fs image:
http://linuxtesting.org/downloads/f2fs_fault_image.zip
BUG_ON is triggered in f2fs driver (messages below are generated on
kernel 3.13.2; for other kernels output is similar):

kernel BUG at fs/f2fs/node.c:215!
 Call Trace:
 [<ffffffffa032ebad>] recover_inode_page+0x1fd/0x3e0 [f2fs]
 [<ffffffff811446e7>] ? __lock_page+0x67/0x70
 [<ffffffff81089990>] ? autoremove_wake_function+0x50/0x50
 [<ffffffffa0337788>] recover_fsync_data+0x1398/0x15d0 [f2fs]
 [<ffffffff812b9e5c>] ? selinux_d_instantiate+0x1c/0x20
 [<ffffffff811cb20b>] ? d_instantiate+0x5b/0x80
 [<ffffffffa0321044>] f2fs_fill_super+0xb04/0xbf0 [f2fs]
 [<ffffffff811b861e>] ? mount_bdev+0x7e/0x210
 [<ffffffff811b8769>] mount_bdev+0x1c9/0x210
 [<ffffffffa0320540>] ? validate_superblock+0x210/0x210 [f2fs]
 [<ffffffffa031cf8d>] f2fs_mount+0x1d/0x30 [f2fs]
 [<ffffffff811b9497>] mount_fs+0x47/0x1c0
 [<ffffffff81166e00>] ? __alloc_percpu+0x10/0x20
 [<ffffffff811d4032>] vfs_kern_mount+0x72/0x110
 [<ffffffff811d6763>] do_mount+0x493/0x910
 [<ffffffff811615cb>] ? strndup_user+0x5b/0x80
 [<ffffffff811d6c70>] SyS_mount+0x90/0xe0
 [<ffffffff8166f8d9>] system_call_fastpath+0x16/0x1b

Found by Linux File System Verification project (linuxtesting.org).

Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Gu Zheng b270ad6f0a f2fs: enable flush_merge only in f2fs is not read-only
Enable flush_merge only in f2fs is not read-only, so does the mount
option show.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:54 +09:00
Gu Zheng 197d46476c f2fs: use __GFP_ZERO to avoid appending set-NULL
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:53 +09:00
Gu Zheng a4ed23f2f1 f2fs: put the bio when issue_flush completed
Put the bio when the flush cmd issued, it also can fix the following
kmemleak:
unreferenced object 0xffff8800270c73c0 (size 200):
  comm "f2fs_flush-7:0", pid 27161, jiffies 4312127988 (age 988.503s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 40 07 81 19 01 88 ff ff  ........@.......
    01 00 00 00 00 00 00 f0 11 14 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81559866>] kmemleak_alloc+0x72/0x96
    [<ffffffff81156f7e>] slab_post_alloc_hook+0x28/0x2a
    [<ffffffff811595b1>] kmem_cache_alloc+0xec/0x157
    [<ffffffff8111924d>] mempool_alloc_slab+0x15/0x17
    [<ffffffff81119513>] mempool_alloc+0x71/0x138
    [<ffffffff81193548>] bio_alloc_bioset+0x93/0x18c
    [<ffffffffa040f857>] issue_flush_thread+0x8d/0x145 [f2fs]
    [<ffffffff8107ac16>] kthread+0xba/0xc2
    [<ffffffff81571b2c>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-05-07 10:21:53 +09:00
Al Viro 8174202b34 write_iter variants of {__,}generic_file_aio_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:38:00 -04:00
Al Viro aad4f8bb42 switch simple generic_file_aio_read() users to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:55 -04:00
Al Viro 5b46f25ddc f2fs: switch to iov_iter_alignment()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:52 -04:00
Al Viro 31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Linus Torvalds 26c12d9334 Merge branch 'akpm' (incoming from Andrew)
Merge second patch-bomb from Andrew Morton:
 - the rest of MM
 - zram updates
 - zswap updates
 - exit
 - procfs
 - exec
 - wait
 - crash dump
 - lib/idr
 - rapidio
 - adfs, affs, bfs, ufs
 - cris
 - Kconfig things
 - initramfs
 - small amount of IPC material
 - percpu enhancements
 - early ioremap support
 - various other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
  MAINTAINERS: update Intel C600 SAS driver maintainers
  fs/ufs: remove unused ufs_super_block_third pointer
  fs/ufs: remove unused ufs_super_block_second pointer
  fs/ufs: remove unused ufs_super_block_first pointer
  fs/ufs/super.c: add __init to init_inodecache()
  doc/kernel-parameters.txt: add early_ioremap_debug
  arm64: add early_ioremap support
  arm64: initialize pgprot info earlier in boot
  x86: use generic early_ioremap
  mm: create generic early_ioremap() support
  x86/mm: sparse warning fix for early_memremap
  lglock: map to spinlock when !CONFIG_SMP
  percpu: add preemption checks to __this_cpu ops
  vmstat: use raw_cpu_ops to avoid false positives on preemption checks
  slub: use raw_cpu_inc for incrementing statistics
  net: replace __this_cpu_inc in route.c with raw_cpu_inc
  modules: use raw_cpu_write for initialization of per cpu refcount.
  mm: use raw_cpu ops for determining current NUMA node
  percpu: add raw_cpu_ops
  slub: fix leak of 'name' in sysfs_slab_add
  ...
2014-04-07 16:38:06 -07:00
Kirill A. Shutemov f1820361f8 mm: implement ->map_pages for page cache
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Linus Torvalds 3021112598 f2fs updates for v3.15
This patch-set includes the following major enhancement patches.
  o introduce large directory support
  o introduce f2fs_issue_flush to merge redundant flush commands
  o merge write IOs as much as possible aligned to the segment
  o add sysfs entries to tune the f2fs configuration
  o use radix_tree for the free_nid_list to reduce in-memory operations
  o remove costly bit operations in f2fs_find_entry
  o enhance the readahead flow for CP/NAT/SIT/SSA blocks
 
 The other bug fixes are as follows.
  o recover xattr node blocks correctly after sudden-power-cut
  o fix to calculate the maximum number of node ids
  o enhance to handle many error cases
 
 And, there are a bunch of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTQiQrAAoJEEAUqH6CSFDSlbIP/iq06BrUeMDLoQFhA2GQFKFD
 wd0A5h9hCiFcKBcI/u/aAQqj/a5wdwzDl9XzH2PzJ45IM6sVGQZ0lv+kdLhab6rk
 ipNbV7G0yLAX+8ygS6GZF7pSKfMzGSGTrRvfdtoiunIip1jCY1IkUxv1XMgBSPza
 wnWYrE5HXEqRUDCqPXJyxrPmx0/0jw8/V82Ng9stnY34ySs+l/3Pvg65Kh0QuSSy
 BRjJUGlOCF68KUBKd+6YB2T5KlbQde3/5lhP+GMOi+xm5sFB+j+59r/WpJpF2Nxs
 ImxQs5GkiU01ErH/rn5FgHY/zzddQenBKwOvrjEeUA1eVpBurdsIr1JN0P6qDbgB
 ho5U8LzCQq+HZiW444eQGkXSOagpUKqDhTVJO7Fji/wG88Atc9gLX3ix8TH2skxT
 C5CvvrJM7DKBtkZyTzotKY/cWorOZhge6E/EkbGaM1sSHdK5b1Rg4YlFi9TDyz0n
 QjGD1uuvEeukeKGdIG9pjc7o5ledbMDYwLpT2RuRXenLOTsn8BqDOo9aRTg+5Kag
 tJNJLFumjPR2mEBNKjicJMUf381J/SKDwZszAz9mgvCZXldMza/Ax0LzJDJCVmkP
 UuBiVzGxVzpd33IsESUDr0J9hc+t8kS10jfAeKnE3cpb6n7/RYxstHh6CHOFKNXM
 gPUSYPN3CYiP47DnSfzA
 =eSW+
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches.
   - introduce large directory support
   - introduce f2fs_issue_flush to merge redundant flush commands
   - merge write IOs as much as possible aligned to the segment
   - add sysfs entries to tune the f2fs configuration
   - use radix_tree for the free_nid_list to reduce in-memory operations
   - remove costly bit operations in f2fs_find_entry
   - enhance the readahead flow for CP/NAT/SIT/SSA blocks

  The other bug fixes are as follows:
   - recover xattr node blocks correctly after sudden-power-cut
   - fix to calculate the maximum number of node ids
   - enhance to handle many error cases

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
  f2fs: fix wrong statistics of inline data
  f2fs: check the acl's validity before setting
  f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
  f2fs: fix to cover io->bio with io_rwsem
  f2fs: fix error path when fail to read inline data
  f2fs: use list_for_each_entry{_safe} for simplyfying code
  f2fs: avoid free slab cache under spinlock
  f2fs: avoid unneeded lookup when xattr name length is too long
  f2fs: avoid unnecessary bio submit when wait page writeback
  f2fs: return -EIO when node id is not matched
  f2fs: avoid RECLAIM_FS-ON-W warning
  f2fs: skip unnecessary node writes during fsync
  f2fs: introduce fi->i_sem to protect fi's info
  f2fs: change reclaim rate in percentage
  f2fs: add missing documentation for dir_level
  f2fs: remove unnecessary threshold
  f2fs: throttle the memory footprint with a sysfs entry
  f2fs: avoid to drop nat entries due to the negative nr_shrink
  f2fs: call f2fs_wait_on_page_writeback instead of native function
  f2fs: introduce nr_pages_to_write for segment alignment
  ...
2014-04-07 10:55:36 -07:00
Chao Yu 48b230a583 f2fs: fix wrong statistics of inline data
If we remove a file that has inline data after mount, our statistics turns to
inaccurate.

cat /sys/kernel/debug/f2fs/status
  - Inline_data Inode: 4294967295

Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.

Change log from v1:
 o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 12:40:58 +09:00
ZhangZhen 3a8861e271 f2fs: check the acl's validity before setting
Before setting the acl, call posix_acl_valid() to check if it is
valid or not.

Signed-off-by: zhangzhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 12:18:30 +09:00
Jaegeuk Kim 6b4afdd794 f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.

If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.

Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-07 09:50:58 +09:00
Linus Torvalds 24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Johannes Weiner 91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Jaegeuk Kim ce23447fe5 f2fs: fix to cover io->bio with io_rwsem
In the f2fs_wait_on_page_writeback, io->bio should be covered by io_rwsem.
Otherwise, the bio pointer can become a dangling pointer due to data races.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu d54c795b49 f2fs: fix error path when fail to read inline data
We should unlock page in ->readpage() path and also should unlock & release page
in error path of ->write_begin() to avoid deadlock or memory leak.
So let's add release code to fix the problem when we fail to read inline data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu 2d7b822ad9 f2fs: use list_for_each_entry{_safe} for simplyfying code
This patch use list_for_each_entry{_safe} instead of list_for_each{_safe} for
simplfying code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:27 +09:00
Chao Yu cf0ee0f09b f2fs: avoid free slab cache under spinlock
Move kmem_cache_free out of spinlock protection region for better performance.

Change log from v1:
 o remove spinlock protection for kmem_cache_free in destroy_node_manager
suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-02 09:56:12 +09:00
Chao Yu 6e452d69d4 f2fs: avoid unneeded lookup when xattr name length is too long
In f2fs_setxattr we have limit this attribute name length, so we should also
check it in f2fs_getxattr to avoid useless lookup caused by invalid name length.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 18:54:24 +09:00
Chao Yu df0f8dc0e1 f2fs: avoid unnecessary bio submit when wait page writeback
This patch introduce is_merged_page() to check whether current page is merged
in f2fs bio cache. When page is not in cache, we can avoid submitting bio cache,
resulting in having more chance to merge pages.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 18:53:41 +09:00
Jaegeuk Kim 3bb5e2c8fe f2fs: return -EIO when node id is not matched
During the cleaing of node segments, F2FS can get errored node blocks due to
data race between node page lock and its valid bitmap operations.
In that case, it needs to return an error to skip such the obsolete block copy.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-04-01 17:38:26 +09:00
Jaegeuk Kim 808a1d7490 f2fs: avoid RECLAIM_FS-ON-W warning
This patch should resolve the following possible bug.

RECLAIM_FS-ON-W at:
 mark_held_locks+0xb9/0x140
 lockdep_trace_alloc+0x85/0xf0
 __kmalloc+0x53/0x1d0
 read_all_xattrs+0x3d1/0x3f0 [f2fs]
 f2fs_getxattr+0x4f/0x100 [f2fs]
 f2fs_get_acl+0x4c/0x290 [f2fs]
 get_acl+0x4f/0x80
 posix_acl_create+0x72/0x180
 f2fs_init_acl+0x29/0xcc [f2fs]
 __f2fs_add_link+0x259/0x710 [f2fs]
 f2fs_create+0xad/0x1c0 [f2fs]
 vfs_create+0xed/0x150
 do_last+0xd36/0xed0
 path_openat+0xc5/0x680
 do_filp_open+0x43/0xa0
 do_sys_open+0x13c/0x230
 SyS_creat+0x1e/0x20
 system_call_fastpath+0x16/0x1b

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:21:08 +09:00
Jaegeuk Kim 479f40c44a f2fs: skip unnecessary node writes during fsync
If multiple redundant fsync calls are triggered, we don't need to write its
node pages with fsync mark continuously.

So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
written with the fsync mark or not.
If the mark was set, a new fsync doesn't need to write a node block.
Otherwise, we should do a new node block with the mark for roll-forward
recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:11 +09:00
Jaegeuk Kim d928bfbfe7 f2fs: introduce fi->i_sem to protect fi's info
This patch introduces fi->i_sem to protect fi's info that includes xattr_ver,
pino, i_nlink.
This enables to remove i_mutex during f2fs_sync_file, resulting in performance
improvement when a number of fsync calls are triggered from many concurrent
threads.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:11 +09:00
Jaegeuk Kim 58c410351e f2fs: change reclaim rate in percentage
It is more reasonable to determine the reclaiming rate of prefree segments
according to the volume size, which is set to 5% by default.
For example, if the volume is 128GB, the prefree segments are reclaimed
when the number reaches to 6.4GB.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:10 +09:00
Jaegeuk Kim a5f420101d f2fs: remove unnecessary threshold
The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis
of the memory footprint.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:09 +09:00
Jaegeuk Kim cdfc41c134 f2fs: throttle the memory footprint with a sysfs entry
This patch introduces ram_thresh, a sysfs entry, which controls the memory
footprint used by the free nid list and the nat cache.

Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
cache was managed by NM_WOUT_THRESHOLD.
However, this approach cannot be applied dynamically according to the system.

So, this patch adds ram_thresh that users can specify the threshold, which is
in order of 1 / 1024.
For example, if the total ram size is 4GB and the value is set to 10 by default,
f2fs tries to control the number of free nids and nat caches not to consume over
10 * (4GB / 1024) = 10MB.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:09 +09:00
Jaegeuk Kim 40bb0058c8 f2fs: avoid to drop nat entries due to the negative nr_shrink
The try_to_free_nats should not receive the negative nr_shrink.
Otherwise, it can drop all the nat entries by the while loop.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:08 +09:00
Jaegeuk Kim 3cb5ad152b f2fs: call f2fs_wait_on_page_writeback instead of native function
If a page is on writeback, f2fs can face with deadlock due to under writepages.
This is caused by merging IOs inside f2fs, so if it comes to detect, let's throw
merged IOs, which is implemented by f2fs_wait_on_page_writeback.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-20 22:10:04 +09:00
Jaegeuk Kim 50c8cdb35a f2fs: introduce nr_pages_to_write for segment alignment
This patch introduces nr_pages_to_write to align page writes to the segment
or other operational unit size, which can be tuned according to the system
environment.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 16:37:53 +09:00
Jaegeuk Kim d3baf95da5 f2fs: increase pages_skipped when skipping writepages
This patch increases pages_skipped when skipping writepages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 16:37:16 +09:00
Jaegeuk Kim 87d6f89094 f2fs: avoid small data writes by skipping writepages
This patch introduces nr_pages_to_skip(sbi, type) to determine writepages can
be skipped.
The dentry, node, and meta pages can be conrolled by F2FS without breaking the
FS consistency.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 13:58:59 +09:00
Jaegeuk Kim f8b2c1f940 f2fs: introduce get_dirty_dents for readability
The get_dirty_dents gives us the number of dirty dentry pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 12:34:30 +09:00
Chao Yu 04c0938844 f2fs: fix incorrect parsing with option string
Previously 'background_gc={on***,off***}' is being parsed as correct option,
with this patch we cloud fix the trivial bug in mount process.

Change log from v1:
 o need to check length of parameter suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 10:13:02 +09:00
Chao Yu e4fc5fbfc9 f2fs: avoid to return incorrect errno of read_normal_summaries
We should return error number of read_normal_summaries instead of -EINVAL when
read_normal_summaries failed.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 09:29:53 +09:00
Chao Yu 4bc8e9bcf5 f2fs: introduce f2fs_has_xattr_block for better readability
This patch introduces a help function f2fs_has_xattr_block for better
readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 09:29:46 +09:00
Chao Yu 90aa6dc9b9 f2fs: print type for each segment in segment_info's show
The original segment_info's show looks out-of-format:
cat /proc/fs/f2fs/loop0/segment_info
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 512
512 512 512 512 512 512 512 0 0 512
348 0 263 0 0 512 0 0 512 512
512 512 0 512 512 512 512 512 512 512
512 512 511 328 512 512 512 512 512 512
512 512 512 512 512 512 512 0 0 175

Let's fix this and show type for each segment.
cat /proc/fs/f2fs/loop0/segment_info
format: segment_type|valid_blocks
segment_type(0:HD, 1:WD, 2:CD, 3:HN, 4:WN, 5:CN)
0    2|0   1|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0
10   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0
20   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0
30   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0
40   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0   0|0
50   3|0   3|0   3|0   3|0   3|0   3|0   3|0   0|0   3|0   3|0
60   3|0   3|0   3|0   3|0   3|0   3|0   3|0   3|0   3|0   3|512
70   3|512 3|512 3|512 3|512 3|512 3|512 3|512 3|0   3|0   3|512
80   3|0   3|0   3|0   3|0   3|0   3|512 3|0   3|0   3|512 3|512
90   3|512 0|512 3|274 0|512 0|512 0|512 0|512 0|512 0|512 3|512
100  3|512 0|512 3|511 0|328 3|512 0|512 0|512 3|512 0|512 0|512
110  0|512 0|512 0|512 0|512 0|512 0|512 0|512 5|0   4|0   3|512

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-18 09:27:18 +09:00
Theodore Ts'o 02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Chao Yu 910bb12d29 f2fs: check upper bound of ino value in f2fs_nfs_get_inode
Upper bound checking of ino should be added to f2fs_nfs_get_inode, so unneeded
process before do_read_inode in f2fs_iget could be avoided when ino is invalid.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-12 18:15:38 +09:00
Chao Yu 987c7c3112 f2fs: introduce f2fs_has_inline_xattr for better readability
This patch introduces a help function f2fs_has_inline_xattr for better
readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-12 17:23:35 +09:00
Chao Yu 28cdce0459 f2fs: recover inline xattr data in roll-forward process
Previously we do not recover inline xattr data of inode after power-cut, so
inline xattr data may be lost.
We should recover the data during the roll-forward process.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-11 16:31:06 +09:00
Gu Zheng d653788a43 f2fs: optimize restore_node_summary slightly
Previously, we ra_sum_pages to pre-read contiguous pages as more
as possible, and if we fail to alloc more pages, an ENOMEM error
will be reported upstream, even though we have alloced some pages
yet. In fact, we can use the available pages to do the job partly,
and continue the rest in the following circle. Only reporting ENOMEM
upstream if we really can not alloc any available page.

And another fix is ignoring dealing with the following pages if an
EIO occurs when reading page from page_list.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: modify the flow for better neat code]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-10 18:45:15 +09:00
Gu Zheng 46c04366bb f2fs: format segment_info's show for better legibility
The original segment_info's show is a bit out-of-format:

[root@guz Demoes]# cat /proc/fs/f2fs/loop0/segment_info
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
......
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 [root@guz Demoes]#

so we fix it here for better legibility.
[root@guz Demoes]# cat /proc/fs/f2fs/loop0/segment_info
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
......
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1
[root@guz Demoes]#

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-10 18:45:15 +09:00
Gu Zheng e8512d2e0c f2fs: remove the unused ctor argument of f2fs_kmem_cache_create()
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-10 18:45:14 +09:00
Gu Zheng b6ce391e61 f2fs: update start nid only once each circle
Integrated a couple of minor changes for better readability suggested by
Chao Yu.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-10 18:45:09 +09:00
Jaegeuk Kim 20f70751c6 f2fs: fix wrong kernel coding style
This patch includes a simple fix to adjust coding style.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-05 10:48:53 +09:00
Jaegeuk Kim c81bf1c84f f2fs: fix to write node pages with WRITE_SYNC
This patch fixes performance regression of dbench reported by
Alex <hbx7d@yandex.com>.

This issue was revealed by Phoronix tests results:
http://www.phoronix.com/scan.php?page=article&item=linux_314_ssdfs&num=2

It turns out that we need to assign WRITE_SYNC to the node writes, if
fsync is triggered.

The performance numbers are like below, which is measured by Alex.
1. 355MB/s       ext4
2. 225MB/s       f2fs : WRITE for node writes
3. 525MB/s       f2fs : WRITE_SYNC for node writes

Reported-And-Tested-by: Alex <hbx7d@yandex.com>.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-03-03 11:28:40 +09:00
Chao Yu 9cf3c3898a f2fs: fix dirty page accounting when redirty
We should de-account dirty counters for page when redirty in ->writepage().

Wu Fengguang described in 'commit 971767caf632190f77a40b4011c19948232eed75':
"writeback: fix dirtied pages accounting on redirty
De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints)."

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-28 13:09:08 +09:00
Chao Yu 695fd1ed3b f2fs: use existing macro to clean up some codes
This patch use existing macro F2FS_INODE/NEXT_FREE_BLKADDR to clean up some
codes.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-27 21:09:28 +09:00
Chao Yu 81c1a0f13e f2fs: readahead contiguous SSA blocks for f2fs_gc
If there are multi segments in one section, we will read those SSA blocks which
have contiguous address one by one in f2fs_gc. It may lost performance, let's
read ahead SSA blocks by merge multi read request.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-27 20:40:36 +09:00
Jaegeuk Kim ab9fa662e4 f2fs: add an sysfs entry to control the directory level
This patch adds an sysfs entry to control dir_level used by the large directory.

The description of this entry is:

 dir_level                    This parameter controls the directory level to
			      support large directory. If a directory has a
			      number of files, it can reduce the file lookup
			      latency by increasing this dir_level value.
			      Otherwise, it needs to decrease this value to
			      reduce the space overhead. The default value is 0.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-27 20:31:15 +09:00
Jaegeuk Kim 3843154598 f2fs: introduce large directory support
This patch introduces an i_dir_level field to support large directory.

Previously, f2fs maintains multi-level hash tables to find a dentry quickly
from a bunch of chiild dentries in a directory, and the hash tables consist of
the following tree structure as below.

In Documentation/filesystems/f2fs.txt,

----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------

level #0   | A(2B)
           |
level #1   | A(2B) - A(2B)
           |
level #2   | A(2B) - A(2B) - A(2B) - A(2B)
     .     |   .       .       .       .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)

But, if we can guess that a directory will handle a number of child files,
we don't need to traverse the tree from level #0 to #N all the time.
Since the lower level tables contain relatively small number of dentries,
the miss ratio of the target dentry is likely to be high.

In order to avoid that, we can configure the hash tables sparsely from level #0
like this.

level #0   | A(2B) - A(2B) - A(2B) - A(2B)

level #1   | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)

With this structure, we can skip the ineffective tree searches in lower level
hash tables.

This patch adds just a facility for this by introducing i_dir_level in
f2fs_inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-27 19:56:09 +09:00
Jaegeuk Kim 5d0c667121 f2fs: remove costly bit operations for f2fs_find_entry
It turns out that a bit operation like find_next_bit is not always fast enough
for f2fs_find_entry.
Instead, it is pretty much simple and fast to traverse each dentries.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-27 16:25:20 +09:00
Jaegeuk Kim 8b8343fa9d f2fs: implement a lock-free stat_show
The stat_show is just to show the current status of f2fs.
So, we can remove all the there-in locks.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:41 +09:00
Jaegeuk Kim 8a7ed66aaf f2fs: introduce a radix_tree for the free_nid list
This patch introduces a radix tree for the list of free_nids, which enhances
the performance on free nid management.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:41 +09:00
Gu Zheng f978f5a061 f2fs: introduce help macro on_build_free_nids()
Introduce help macro on_build_free_nids() which just uses build_lock
to judge whether the building free nid is going, so that we can remove
the on_build_free_nids field from f2fs_sb_info.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: remove an unnecessary white line removal]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:40 +09:00
Jaegeuk Kim fffc2a00fc f2fs: fix to mark the checkpointed nat entry correctly
The nat cache entry maintains a status whether it is checkpointed or not.
So, if a new cache entry is loaded from the last checkpoint,
nat_entry->checkpointed should be true.
If the cache entry is modified as being dirty, nat_entry->checkpoint should
be false.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:40 +09:00
Jaegeuk Kim 6437d1b0ad f2fs: fix to do build_stat prior to the recovery procedure
At the end of the recovery procedure, write_checkpoint is called and updates
the cp count which is managed by f2fs stat.
But, previously build_stat() is called after the recovery procedure, which
results in:

BUG: unable to handle kernel NULL pointer dereference at 000000000000012c
IP: [<ffffffffa03b1030>] write_checkpoint+0x720/0xbc0 [f2fs]
Call Trace:
 [<ffffffff810a6b44>] ? mark_held_locks+0x74/0x140
 [<ffffffff8109a3e0>] ? __init_waitqueue_head+0x60/0x60
 [<ffffffffa03bf036>] recover_fsync_data+0x656/0xf20 [f2fs]
 [<ffffffff812ee3eb>] ? security_d_instantiate+0x1b/0x30
 [<ffffffffa03aeb4d>] f2fs_fill_super+0x94d/0xa00 [f2fs]
 [<ffffffff811a9825>] mount_bdev+0x1a5/0x1f0
 [<ffffffff8114915e>] ? __get_free_pages+0xe/0x40
 [<ffffffffa03ae200>] ? f2fs_remount+0x130/0x130 [f2fs]
 [<ffffffffa03aa575>] f2fs_mount+0x15/0x20 [f2fs]
 [<ffffffff811aa713>] mount_fs+0x43/0x1b0
 [<ffffffff811c7124>] vfs_kern_mount+0x74/0x160
 [<ffffffff811c5cb1>] ? __get_fs_type+0x51/0x60
 [<ffffffff811c9727>] do_mount+0x237/0xb50
 [<ffffffff811c936a>] ? copy_mount_options+0x3a/0x170

So, this patche changes the order of recovery_fsync_data() and
f2fs_build_stats().

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:39 +09:00
Jaegeuk Kim 8618b881e9 f2fs: fix not to write data pages on the page reclaiming path
Even if f2fs_write_data_page is called by the page reclaiming path, we should
not write the page to provide enough free segments for the worst case scenario.
Otherwise, f2fs can face with no free segment while gc is conducted, resulting
in:

 ------------[ cut here ]------------
 kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/segment.c:565!
 RIP: 0010:[<ffffffffa02c3b11>]  [<ffffffffa02c3b11>] new_curseg+0x331/0x340 [f2fs]
 Call Trace:
  allocate_segment_by_default+0x204/0x280 [f2fs]
  allocate_data_block+0x108/0x210 [f2fs]
  write_data_page+0x8a/0xc0 [f2fs]
  do_write_data_page+0xe1/0x2a0 [f2fs]
  move_data_page+0x8a/0xf0 [f2fs]
  f2fs_gc+0x446/0x970 [f2fs]
  f2fs_balance_fs+0xb6/0xd0 [f2fs]
  f2fs_write_begin+0x50/0x350 [f2fs]
  ? unlock_page+0x27/0x30
  ? unlock_page+0x27/0x30
  generic_file_buffered_write+0x10a/0x280
  ? file_update_time+0xa3/0xf0
  __generic_file_aio_write+0x1c8/0x3d0
  ? generic_file_aio_write+0x52/0xb0
  ? generic_file_aio_write+0x52/0xb0
  generic_file_aio_write+0x65/0xb0
  do_sync_write+0x5a/0x90
  vfs_write+0xc5/0x1f0
  SyS_write+0x55/0xa0
  system_call_fastpath+0x16/0x1b

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-24 16:00:33 +09:00
Jaegeuk Kim b63da15e8b f2fs: fix the calculation of max_nids
Total nids that f2fs can use should not include 0, nid for node inode, and nid
for meta inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Changman Lee 942e0be621 f2fs: show counts of checkpoint in status
This patch shows the counts of checkpoint in f2fs' status.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Chao Yu 662befda25 f2fs: introduce ra_meta_pages to readahead CP/NAT/SIT pages
This patch help us to cleanup the readahead code by merging ra_{sit,nat}_pages
function into ra_meta_pages.
Additionally the new function is used to readahead cp block in
recover_orphan_inodes.

Change log from v1:
 o fix a deadloop bug pointed by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Chao Yu 3375f696bd f2fs: use inode mutex to keep atomicity of f2fs_falloc
Previously without protection of inode mutex, f2fs_falloc and other data
correlated operations will interfere with each other.
So let's use inode mutex to keep atomicity of f2fs_falloc.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Jaegeuk Kim 1fe54f9dd3 f2fs: clean up redundant function call
This patch integrates inode_[inc|dec]_dirty_dents with inc_page_count to remove
redundant calls.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Jaegeuk Kim 203681f65b f2fs: fix f2fs_write_meta_page at no checkpoint status
If f2fs entered errorneous checkpoint status, it should skip writing meta
pages instead of redirtying the pages out.
Otherwise, it cannot unmount the partition even though f2fs is under read-only
status.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:53 +09:00
Jaegeuk Kim bd859c6598 f2fs: fix to truncate dentry pages in the error case
When a new directory is allocated, if an error is occurred, we should truncate
preallocated dentry pages too.

This bug was reported by Andrey Tsyvarev after a while as follows.

mkdir()->
 f2fs_add_link()->
  init_inode_metadata()->
    f2fs_init_acl()->
      f2fs_get_acl()->
        f2fs_getxattr()->
          read_all_xattrs() fails.

Also there was a BUG_ON triggered after the fault in
mkdir()->
 f2fs_add_link()->
   init_inode_metadata()->
    remove_inode_page() ->
      f2fs_bug_on(inode->i_blocks != 0 && inode->i_blocks != 1);

But, previous patch wasn't perfect to resolve that bug, so the following bug
report was also submitted.

kernel BUG at fs/f2fs/inode.c:274!
Call Trace:
 [<ffffffff811fde03>] evict+0xa3/0x1a0
 [<ffffffff811fe615>] iput+0xf5/0x180
 [<ffffffffa01c7f63>] f2fs_mkdir+0xf3/0x150 [f2fs]
 [<ffffffff811f2a77>] vfs_mkdir+0xb7/0x160
 [<ffffffff811f36bf>] SyS_mkdir+0x5f/0xc0
 [<ffffffff81680769>] system_call_fastpath+0x16/0x1b

Finally, this patch resolves all the issues like below.

If an error is occurred after make_empty_dir(),
 1. truncate_inode_pages()
   The make_bad_inode() prior to iput() will change i_mode to S_IFREG, which
   means that f2fs will not decrement fi->dirty_dents during f2fs_evict_inode.
   But, by calling it here, we can do that.

 2. truncate_blocks()
   Preallocated dentry pages are trucated here to sync i_blocks.

 3. remove_dirty_dir_inode()
   Remove this directory inode from the list.

Reported-and-Tested-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim f6517cfc84 f2fs: fix a build warning
This patch modifies flow a little bit to avoid the following build warnings.

src/fs/f2fs/recovery.c: In function ‘check_index_in_prev_nodes’:
src/fs/f2fs/recovery.c:288:51: warning: ‘sum.<U5390>.<U52f8>.ofs_in_node’ may
	be used uninitialized in this function [-Wmaybe-uninitialized]
src/fs/f2fs/recovery.c:260:23: warning: ‘sum.nid’ may be used uninitialized
	in this function [-Wmaybe-uninitialized]

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim 491c0854b4 f2fs: clean up with a macro
This patch adds GET_BLKOFF_FROM_SEG0 to clean up some codes.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim 924a2ddbd0 f2fs: fix the potential mismatch between dir's i_size and i_blocks
This is the erroneous scenario.

                             i_size    on-disk i_size    i_blocks
__f2fs_add_link()             4096           4096           2
 get_new_data_page            8192           4096           3
 -ENOSPC = init_inode_metadata
 checkpoint                     -            4096           3
 POR and reboot

__f2fs_add_link()             4096           4096           3
 page = get_new_data_page (page->index = 1 by NEW_ADDR)
 add a dentry to the page successfully

f2fs_rmdir()
 f2fs_empty_dir()             4096           4096           3
 f2fs_unlink() goes, since there is no valid dentry due to i_size = 4096.
 But, still there is one dentry in page->index = 1.

So this patch moves the code to write dir->i_size into on-disk i_size in order
to sync dir's i_size, on-disk i_size, and its i_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim 1b1f559fc3 f2fs: remove the ugly pointer conversion
This patch modifies the use of bi_private to remove pointer chasing for sbi.
Previously, we had a bi_private structure, but it needs memory allocation.
So this patch uses bi_private by the sbi pointer and adds a completion pointer
into the sbi.
This can achieve no memory allocation and nice use of the bi_private.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim abb2366c82 f2fs: fix to recover xattr node block
If a new xattr node page was allocated and its inode is fsynced, we should
recover the xattr node page during the roll-forward process after power-cut.
But, previously, f2fs didn't handle that case, resulting in kernel panic as
follows reported by Tom Li.

BUG: unable to handle kernel paging request at ffffc9001c861a98
IP: [<ffffffffa0295236>] check_index_in_prev_nodes+0x86/0x2d0 [f2fs]
Call Trace:
 [<ffffffff815ece9b>] ? printk+0x48/0x4a
 [<ffffffffa029626a>] recover_fsync_data+0xdca/0xf50 [f2fs]
 [<ffffffffa02873ae>] f2fs_fill_super+0x92e/0x970 [f2fs]
 [<ffffffff8112c9f8>] mount_bdev+0x1b8/0x200
 [<ffffffffa0286a80>] ? f2fs_remount+0x130/0x130 [f2fs]
 [<ffffffffa0285e40>] f2fs_mount+0x10/0x20 [f2fs]
 [<ffffffff8112d4de>] mount_fs+0x3e/0x1b0
 [<ffffffff810ef4eb>] ? __alloc_percpu+0xb/0x10
 [<ffffffff8114761f>] vfs_kern_mount+0x6f/0x120
 [<ffffffff811497b9>] do_mount+0x259/0xa90
 [<ffffffff810ead1d>] ? memdup_user+0x3d/0x80
 [<ffffffff810eadb3>] ? strndup_user+0x53/0x70
 [<ffffffff8114a2c9>] SyS_mount+0x89/0xd0
 [<ffffffff815feae2>] system_call_fastpath+0x16/0x1b

This patch adds a recovery function of xattr node pages.

Reported-by: Tom Li <biergaizi@members.fsf.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim 5e443818fa f2fs: handle dirty segments inside refresh_sit_entry
This patch cleans up the refresh_sit_entry to handle locate_dirty_segments.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:52 +09:00
Jaegeuk Kim 744602cf45 f2fs: update_inode_page should be done all the time
In order to make fs consistency, update_inode_page should not be failed all
the time. Otherwise, it is possible to lose some metadata in the inode like
a link count.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-02-17 14:58:51 +09:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds bf3d846b78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted stuff; the biggest pile here is Christoph's ACL series.  Plus
  assorted cleanups and fixes all over the place...

  There will be another pile later this week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
  __dentry_path() fixes
  vfs: Remove second variable named error in __dentry_path
  vfs: Is mounted should be testing mnt_ns for NULL or error.
  Fix race when checking i_size on direct i/o read
  hfsplus: remove can_set_xattr
  nfsd: use get_acl and ->set_acl
  fs: remove generic_acl
  nfs: use generic posix ACL infrastructure for v3 Posix ACLs
  gfs2: use generic posix ACL infrastructure
  jfs: use generic posix ACL infrastructure
  xfs: use generic posix ACL infrastructure
  reiserfs: use generic posix ACL infrastructure
  ocfs2: use generic posix ACL infrastructure
  jffs2: use generic posix ACL infrastructure
  hfsplus: use generic posix ACL infrastructure
  f2fs: use generic posix ACL infrastructure
  ext2/3/4: use generic posix ACL infrastructure
  btrfs: use generic posix ACL infrastructure
  fs: make posix_acl_create more useful
  fs: make posix_acl_chmod more useful
  ...
2014-01-28 08:38:04 -08:00
Christoph Hellwig a6dda0e63e f2fs: use generic posix ACL infrastructure
f2fs has some weird mode bit handling, so still using the old
chmod code for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:19 -05:00
Christoph Hellwig 37bc15392a fs: make posix_acl_create more useful
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig 5bf3258fd2 fs: make posix_acl_chmod more useful
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Jaegeuk Kim bf39c00a9a f2fs: drop obsolete node page when it is truncated
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-23 08:04:21 +09:00
Jaegeuk Kim 4ef51a8fcc f2fs: introduce NODE_MAPPING for code consistency
This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by
Gu Zheng.

Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-22 18:41:08 +09:00
Gu Zheng 63f5384c9a f2fs: remove the orphan block page array
As the orphan_blocks may be max to 504, so it is not security
and rigorous to store such a large array in the kernel stack
as Dan Carpenter said.
In fact, grab_meta_page has locked the page in the page cache,
and we can use find_get_page() to fetch the page safely in the
downstream, so we can remove the page array directly.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-22 18:41:08 +09:00
Gu Zheng 9df27d982d f2fs: add help function META_MAPPING
Introduce help function META_MAPPING() to get the cache meta blocks'
address space.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-22 18:41:07 +09:00
Jaegeuk Kim e8dae60458 f2fs: move a branch for code redability
This patch moves a function in f2fs_delete_entry for code readability.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-22 18:41:07 +09:00
Jaegeuk Kim a18ff06340 f2fs: call mark_inode_dirty to flush dirty pages
If a dentry page is updated, we should call mark_inode_dirty to add the inode
into the dirty list, so that its dentry pages are flushed to the disk.
Otherwise, the inode can be evicted without flush.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-22 18:40:34 +09:00
Chris Fries 6c311ec6c2 f2fs: clean checkpatch warnings
Fixed a variety of trivial checkpatch warnings.  The only delta should
be some minor formatting on log strings that were split / too long.

Signed-off-by: Chris Fries <cfries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-20 10:27:12 +09:00
Changman Lee c434cbc0ed f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-16 17:28:35 +09:00
Jaegeuk Kim c33ec32692 f2fs: avoid f2fs_balance_fs call during pageout
This patch should resolve the following bug.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
 (&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
 (&sbi->cp_rwsem){++++.?}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
Chain exists of:
  &sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sbi->cp_rwsem);
                               local_irq_disable();
                               lock(&sbi->gc_mutex);
                               lock(&sbi->cp_mutex);
  <Interrupt>
    lock(&sbi->gc_mutex);

 *** DEADLOCK ***

This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-16 16:20:40 +09:00
Changman Lee 499046ab2c f2fs: add delimiter to seperate name and value in debug phrase
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/status

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14 18:22:17 +09:00
Gu Zheng 17b692f60e f2fs: use spinlock rather than mutex for better speed
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14 18:12:05 +09:00
Gu Zheng c1ef372572 f2fs: move alloc new orphan node out of lock protection region
Move alloc new orphan node out of lock protection region.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14 18:12:04 +09:00
Gu Zheng 4531929e39 f2fs: move grabing orphan pages out of protection region
Move grabing orphan block page out of protection region, and grab all
the orphan block pages ahead.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove unnecessary code pointed by Chao Yu]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14 18:11:20 +09:00
Yuan Zhong 5514f0aadd f2fs: remove the needless parameter of f2fs_wait_on_page_writeback
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.

Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14 17:45:54 +09:00
Jaegeuk Kim b1c57c1caa f2fs: add a sysfs entry to control max_victim_search
Previously during SSR and GC, the maximum number of retrials to find a victim
segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default.

This number makes an effect on IO locality, when SSR mode is activated, which
results in performance fluctuation on some low-end devices.

If max_victim_search = 4, the victim will be searched like below.
("D" represents a dirty segment, and "*" indicates a selected victim segment.)

 D1 D2 D3 D4 D5 D6 D7 D8 D9
[   *       ]
      [   *    ]
            [         * ]
	                [ ....]

This patch adds a sysfs entry to control the number dynamically through:
  /sys/fs/f2fs/$dev/max_victim_search

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-08 13:45:08 +09:00
Jaegeuk Kim fb5566da91 f2fs: improve write performance under frequent fsync calls
When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.

N: Node IO, D: Data IO, IO scheduler: cfq

Issue    pending IOs
	 D1 D2 D3 D4
 D1         D2 D3 D4 N1
 D2            D3 D4 N1 N2
 N1            D3 D4 N2 D1
 --> N1 can be selected by cfq becase of the same priority of N and D.
     Then D3 and D4 would be delayed, resuling in performance degradation.

So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-08 11:16:20 +09:00
Chao Yu 04a17fb17f f2fs: avoid to read inline data except first page
Here is a case which could read inline page data not from first page.

1. write inline data
2. lseek to offset 4096
3. read 4096 bytes from offset 4096
	(read_inline_data read inline data page to non-first page,
	And previously VFS has add this page to page cache)
4. ftruncate offset 8192
5. read 4096 bytes from offset 4096
	(we meet this updated page with inline data in cache)

So we should leave this page with inited data and uptodate flag
for this case.

Change log from v1:
 o fix a deadlock bug

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:22 +09:00
Chao Yu 18309aaa41 f2fs: avoid to left uninitialized data in page when read inline data
Change log from v1:
 o reduce unneeded memset in __f2fs_convert_inline_data

>From 58796be2bd2becbe8d52305210fb2a64e7dd80b6 Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2.yu@samsung.com>
Date: Mon, 30 Dec 2013 09:21:33 +0800
Subject: [PATCH] f2fs: avoid to left uninitialized data in page when read
 inline data

We left uninitialized data in the tail of page when we read an inline data
page. So let's initialize left part of the page excluding inline data region.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:22 +09:00
shifei10.ge a225dca394 f2fs: fix truncate_partial_nodes bug
The truncate_partial_nodes puts pages incorrectly in the following two cases.
Note that the value for argc 'depth' can only be 2 or 3.
Please see truncate_inode_blocks() and truncate_partial_nodes().

1) An err is occurred in the first 'for' loop
  When err is occurred with depth = 2, pages[0] is invalid, so this page doesn't
  need to be put. There is no problem, however, when depth is 3, it doesn't put
  the pages correctly where pages[0] is valid and pages[1] is invalid.
  In this case, depth is set to 2 (ref to statemnt depth = i + 1), and then
  'goto fail'.
  In label 'fail', for (i = depth - 3; i >= 0; i--) cannot meet the condition
  because i = -1, so pages[0] cann't be put.

2) An err happened in the second 'for' loop
  Now we've got pages[0] with depth = 2, or we've got pages[0] and pages[1]
  with depth = 3. When an err is detected, we need 'goto fail' to put such
  the pages.
  When depth is 2, in label 'fail', for (i = depth - 3; i >= 0; i--) cann't
  meet the condition because i = -1, so pages[0] cann't be put.
  When depth is 3, in label 'fail', for (i = depth - 3; i >= 0; i--) can
  only put pages[0], pages[1] also cann't be put.

Note that 'depth' has been changed before first 'goto fail' (ref to statemnt
depth = i + 1), so passing this modified 'depth' to the tracepoint,
trace_f2fs_truncate_partial_nodes, is also incorrect.

Signed-off-by: Shifei Ge <shifei10.ge@samsung.com>
[Jaegeuk Kim: modify the description and fix one bug]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:21 +09:00
Jaegeuk Kim a8865372a8 f2fs: handle errors correctly during f2fs_reserve_block
The get_dnode_of_data nullifies inode and node page when error is occurred.

There are two cases that passes inode page into get_dnode_of_data().

1. make_empty_dir()
    -> get_new_data_page()
      -> f2fs_reserve_block(ipage)
	-> get_dnode_of_data()

2. f2fs_convert_inline_data()
    -> __f2fs_convert_inline_data()
      -> f2fs_reserve_block(ipage)
	-> get_dnode_of_data()

This patch adds correct error handling codes when get_dnode_of_data() returns
an error.

At first, f2fs_reserve_block() calls f2fs_put_dnode() whenever reserve_new_block
returns an error.
So, the rule of f2fs_reserve_block() is to nullify inode page when there is any
error internally.

Finally, two callers of f2fs_reserve_block() should call f2fs_put_dnode()
appropriately if they got an error since successful f2fs_reserve_block().

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:21 +09:00
Jaegeuk Kim 1e1bb4baf1 f2fs: add inline_data recovery routine
This patch adds a inline_data recovery routine with the following policy.

[prev.] [next] of inline_data flag
   o       o  -> recover inline_data
   o       x  -> remove inline_data, and then recover data blocks
   x       o  -> remove inline_data, and then recover inline_data
   x       x  -> recover data blocks

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:20 +09:00
Jaegeuk Kim 0dbdc2ae9b f2fs: add the number of inline_data files to status info
This patch adds the number of inline_data files into the status information.
Note that the number is reset whenever the filesystem is newly mounted.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:20 +09:00
Jaegeuk Kim 9e09fc855d f2fs: refactor f2fs_convert_inline_data
Change log from v1:
 o handle NULL pointer of grab_cache_page_write_begin() pointed by Chao Yu.

This patch refactors f2fs_convert_inline_data to check a couple of conditions
internally for deciding whether it needs to convert inline_data or not.

So, the new f2fs_convert_inline_data initially checks:
1) f2fs_has_inline_data(), and
2) the data size to be changed.

If the inode has inline_data but the size to fill is less than MAX_INLINE_DATA,
then we don't need to convert the inline_data with data allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:19 +09:00
Jaegeuk Kim 26f466f4a9 f2fs: call f2fs_put_page at the error case
In f2fs_write_begin(), if f2fs_conver_inline_data() returns an error like
-ENOSPC, f2fs should call f2fs_put_page().
Otherwise, it is remained as a locked page, resulting in the following bug.

[<ffffffff8114657e>] sleep_on_page+0xe/0x20
[<ffffffff81146567>] __lock_page+0x67/0x70
[<ffffffff81157d08>] truncate_inode_pages_range+0x368/0x5d0
[<ffffffff81157ff5>] truncate_inode_pages+0x15/0x20
[<ffffffff8115804b>] truncate_pagecache+0x4b/0x70
[<ffffffff81158082>] truncate_setsize+0x12/0x20
[<ffffffffa02a1842>] f2fs_setattr+0x72/0x270 [f2fs]
[<ffffffff811cdae3>] notify_change+0x213/0x400
[<ffffffff811ab376>] do_truncate+0x66/0xa0
[<ffffffff811ab541>] vfs_truncate+0x191/0x1b0
[<ffffffff811ab5bc>] do_sys_truncate+0x5c/0xa0
[<ffffffff811ab78e>] SyS_truncate+0xe/0x10
[<ffffffff81756052>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:19 +09:00
Jaegeuk Kim 8230a0a49f f2fs: convert inline_data for punch_hole
In the punch_hole(), let's convert inline_data all the time for simplicity and
to avoid potential deadlock conditions.
It is pretty much not a big deal to do this.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-06 16:42:12 +09:00
Jaegeuk Kim f185ff979f f2fs: don't need to get f2fs_lock_op for the inline_data test
This patch locates checking the inline_data prior to calling f2fs_lock_op()
in truncate_blocks(), since getting the lock is unnecessary.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-27 12:40:41 +09:00
Huajun Li 9ffe0fb5f3 f2fs: handle inline data operations
Hook inline data read/write, truncate, fallocate, setattr, etc.

Files need meet following 2 requirement to inline:
 1) file size is not greater than MAX_INLINE_DATA;
 2) file doesn't pre-allocate data blocks by fallocate().

FI_INLINE_DATA will not be set while creating a new regular inode because
most of the files are bigger than ~3.4K. Set FI_INLINE_DATA only when
data is submitted to block layer, ranther than set it while creating a new
inode, this also avoids converting data from inline to normal data block
and vice versa.

While writting inline data to inode block, the first data block should be
released if the file has a block indexed by i_addr[0].

On the other hand, when a file operation is appied to a file with inline
data, we need to test if this file can remain inline by doing this
operation, otherwise it should be convert into normal file by reserving
a new data block, copying inline data to this new block and clear
FI_INLINE_DATA flag. Because reserve a new data block here will make use
of i_addr[0], if we save inline data in i_addr[0..872], then the first
4 bytes would be overwriten. This problem can be avoided simply by
not using i_addr[0] for inline data.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:40:41 +09:00
Huajun Li e18c65b2ac f2fs: key functions to handle inline data
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:40:09 +09:00
Gu Zheng 0d47c1adc2 f2fs: convert max_orphans to a field of f2fs_sb_info
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:37:52 +09:00
Jaegeuk Kim 944fcfc184 f2fs: check the blocksize before calling generic_direct_IO path
The f2fs supports 4KB block size. If user requests dwrite with under 4KB data,
it allocates a new 4KB data block.
However, f2fs doesn't add zero data into the untouched data area inside the
newly allocated data block.

This incurs an error during the xfstest #263 test as follow.

263 12s ... [failed, exit status 1] - output mismatch (see 263.out.bad)
	--- 263.out	2013-03-09 03:37:15.043967603 +0900
	+++ 263.out.bad	2013-12-27 04:20:39.230203114 +0900
	@@ -1,3 +1,976 @@
	QA output created by 263
	fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	+fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	+truncating to largest ever: 0x12a00
	+truncating to largest ever: 0x75400
	+fallocating to largest ever: 0x79cbf
	...
	(Run 'diff -u 263.out 263.out.bad' to see the entire diff)
	Ran: 263
	Failures: 263
	Failed 1 of 1 tests

It turns out that, when the test tries to write 2KB data with dio, the new dio
path allocates 4KB data block without filling zero data inside the remained 2KB
area. Finally, the output file contains a garbage data for that region.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:33:07 +09:00
Jaegeuk Kim 1ec79083b2 f2fs: should put the dnode when NEW_ADDR is detected
When get_dnode_of_data() in get_data_block() returns a successful dnode, we
should put the dnode.
But, previously, if its data block address is equal to NEW_ADDR, we didn't do
that, resulting in a deadlock condition.
So, this patch splits original error conditions with this case, and then calls
f2fs_put_dnode before finishing the function.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:33:06 +09:00
Jaegeuk Kim 58bfaf44df f2fs: introduce F2FS_INODE macro to get f2fs_inode
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.

   struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:32:48 +09:00
Chao Yu d96b143151 f2fs: check filename length in recover_dentry
In current flow, we will get Null return value of f2fs_find_entry in
recover_dentry when name.len is bigger than F2FS_NAME_LEN, and then we
still add this inode into its dir entry.
To avoid this situation, we must check filename length before we use it.

Another point is that we could remove the code of checking filename length
In f2fs_find_entry, because f2fs_lookup will be called previously to ensure of
validity of filename length.

V2:
 o add WARN_ON() as Jaegeuk Kim suggested.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 12:50:09 +09:00
Chao Yu deead09009 f2fs: avoid to set wrong pino of inode when rename dir
When we rename a dir to new name which is not exist previous,
we will set pino of parent inode with ino of child inode in f2fs_set_link.
It destroy consistency of pino, it should be fixed.

Thanks for previous work of Shu Tan.

Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:42:51 +09:00
Chao Yu 4f4124d0b9 f2fs: update several comments
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:26:03 +09:00
Gu Zheng 7e8f23081a f2fs: remove the rw_flag domain from f2fs_io_info
When using the f2fs_io_info in the low level, we still need to merge the
rw and rw_flag, so use the rw to hold all the io flags directly,
and remove the rw_flag field.

ps.It is based on the previous patch:
f2fs: move all the bio initialization into __bio_alloc

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Gu Zheng 940a6d34b3 f2fs: move all the bio initialization into __bio_alloc
Move all the bio initialization into __bio_alloc, and some minor cleanups are
also added.

v3:
  Use 'bool' rather than 'int' as Kim suggested.

v2:
  Use 'is_read' rather than 'rw' as Yu Chao suggested.
  Remove the needless initialization of bio->bi_private.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim 5459aa9770 f2fs: write dirty meta pages collectively
This patch enhances writing dirty meta pages collectively in background.
During the file data writes, it'd better avoid to write small dirty meta pages
frequently.
So let's give a chance to collect a number of dirty meta pages for a while.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim bfad7c2d40 f2fs: introduce a new direct_IO write path
Previously, f2fs doesn't support direct IOs with high performance, which throws
every write requests via the buffered write path, resulting in highly
performance degradation due to memory opeations like copy_from_user.

This patch introduces a new direct IO path in which every write requests are
processed by generic blockdev_direct_IO() with enhanced get_block function.

The get_data_block() in f2fs handles:
1. if original data blocks are allocates, then give them to blockdev.
2. otherwise,
  a. preallocate requested block addresses
  b. do not use extent cache for better performance
  c. give the block addresses to blockdev

This policy induces that:
- new allocated data are sequentially written to the disk
- updated data are randomly written to the disk.
- f2fs gives consistency on its file meta, not file data.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim 216fbd6443 f2fs: introduce sysfs entry to control in-place-update policy
This patch introduces new sysfs entries for users to control the policy of
in-place-updates, namely IPU, in f2fs.

Sometimes f2fs suffers from performance degradation due to its out-of-place
update policy that produces many additional node block writes.
If the storage performance is very dependant on the amount of data writes
instead of IO patterns, we'd better drop this out-of-place update policy.

This patch suggests 5 polcies and their triggering conditions as follows.

[sysfs entry name = ipu_policy]

0: F2FS_IPU_FORCE       all the time,
1: F2FS_IPU_SSR         if SSR mode is activated,
2: F2FS_IPU_UTIL        if FS utilization is over threashold,
3: F2FS_IPU_SSR_UTIL    if SSR mode is activated and FS utilization is over
                        threashold,
4: F2FS_IPU_DISABLE    disable IPU. (=default option)

[sysfs entry name = min_ipu_util]

This parameter controls the threshold to trigger in-place-updates.
The number indicates percentage of the filesystem utilization, and used by
F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.

For more details, see need_inplace_update() in segment.h.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Changman Lee 5dcd8a7150 f2fs: missing kmem_cache_destroy for discard_entry
insmod f2fs.ko is failed after insmod and rmmod firstly.

$ sudo insmod fs/f2fs/f2fs.ko
insmod: error inserting 'fs/f2fs/f2fs.ko': -1 Cannot allocate memory

-- dmesg --
kmem_cache_sanity_check (free_nid): Cache name already exists.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim 76130ccabc f2fs: fix the location of tracepoint
We need to get a trace before submit_bio, since its bi_sector is remapped during
the submit_bio.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Jaegeuk Kim 458e6197c3 f2fs: refactor bio->rw handling
This patch introduces f2fs_io_info to mitigate the complex parameter list.

struct f2fs_io_info {
	enum page_type type;		/* contains DATA/NODE/META/META_FLUSH */
	int rw;				/* contains R/RS/W/WS */
	int rw_flag;			/* contains REQ_META/REQ_PRIO */
}

1. f2fs_write_data_pages
 - DATA
 - WRITE_SYNC is set when wbc->WB_SYNC_ALL.

2. sync_node_pages
 - NODE
 - WRITE_SYNC all the time

3. sync_meta_pages
 - META
 - WRITE_SYNC all the time
 - REQ_META | REQ_PRIO all the time

 ** f2fs_submit_merged_bio() handles META_FLUSH.

4. ra_nat_pages, ra_sit_pages, ra_sum_pages
 - META
 - READ_SYNC

Cc: Fan Li <fanofcode.li@samsung.com>
Cc: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Fan Li 63a0b7cb33 f2fs: merge pages with the same sync_mode flag
Previously f2fs submits most of write requests using WRITE_SYNC, but f2fs_write_data_pages
submits last write requests by sync_mode flags callers pass.

This causes a performance problem since continuous pages with different sync flags
can't be merged in cfq IO scheduler(thanks yu chao for pointing it out), and synchronous
requests often take more time.

This patch makes the following modifies to DATA writebacks:

1. every page will be written back using the sync mode caller pass.
2. only pages with the same sync mode can be merged in one bio request.

These changes are restricted to DATA pages.Other types of writebacks are modified
To remain synchronous.

In my test with tiotest, f2fs sequence write performance is improved by about 7%-10% ,
and this patch has no obvious impact on other performance tests.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Jaegeuk Kim 6bacf52fb5 f2fs: add unlikely() macro for compiler more aggressively
This patch adds unlikely() macro into the most of codes.
The basic rule is to add that when:
- checking unusual errors,
- checking page mappings,
- and the other unlikely conditions.

Change log from v1:
 - Don't add unlikely for the NULL test and error test: advised by Andi Kleen.

Cc: Chao Yu <chao2.yu@samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu cfb271d485 f2fs: add unlikely() macro for compiler optimization
As we know, some of our branch condition will rarely be true. So we could add
'unlikely' to let compiler optimize these code, by this way we could drop
unneeded 'jump' assemble code to improve performance.

change log:
 o add *unlikely* as many as possible across the whole source files at once
   suggested by Jaegeuk Kim.

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu b9987a277f f2fs: avoid unneeded page release for correct _count of page
In find_fsync_dnodes() and recover_data(), our flow is like this:

->f2fs_submit_page_bio()
	-> f2fs_put_page()
		-> page_cache_release()	---- page->_count declined to zero.
->__free_pages()
	-> put_page_testzero() ---- page->_count will be declined again.

We will get a segment fault in put_page_testzero when CONFIG_DEBUG_VM
is on, or return MM with a bad page with wrong _count num.

So let's just release this page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu a0acdfe05a f2fs: use inner macro GFP_F2FS_ZERO for simplification
Use inner macro GFP_F2FS_ZERO to instead of GFP_NOFS | __GFP_ZERO for
simplification of code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Younger Liu 40e1ebe97d f2fs: replace the debugfs_root with f2fs_debugfs_root
This minor change for the naming conventions of debugfs_root
to avoid any possible conflicts to the other filesystem.

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
[Jaegeuk Kim: change the patch name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Younger Liu c524723ebf f2fs: remove debufs dir if debugfs_create_file() failed
When debugfs_create_file() failed in f2fs_create_root_stats(),
debugfs_root should be remove.

Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu 9af0ff1c52 f2fs: readahead contiguous pages for restore_node_summary
If cp has no CP_UMOUNT_FLAG, we will read all pages in whole node segment
one by one, it makes low performance. So let's merge contiguous pages and
readahead for better performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: adjust the new bio operations]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Jaegeuk Kim 93dfe2ac51 f2fs: refactor bio-related operations
This patch integrates redundant bio operations on read and write IOs.

1. Move bio-related codes to the top of data.c.
2. Replace f2fs_submit_bio with f2fs_submit_merged_bio, which handles read
   bios additionally.
3. Introduce __submit_merged_bio to submit the merged bio.
4. Change f2fs_readpage to f2fs_submit_page_bio.
5. Introduce f2fs_submit_page_mbio to integrate previous submit_read_page and
   submit_write_page.

Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com >
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Jaegeuk Kim 187b5b8b3d f2fs: remove the own bi_private allocation
Previously f2fs allocates its own bi_private data structure all the time even
though we don't use it. But, can we remove this bi_private allocation?

This patch removes such the additional bi_private allocation.

1. Retrieve f2fs_sb_info from its page->mapping->host->i_sb.
 - This removes the usecases of bi_private in end_io.

2. Use bi_private only when we really need it.
 - The bi_private is used only when the checkpoint procedure is conducted.
 - When conducting the checkpoint, f2fs submits a META_FLUSH bio to wait its bio
completion.
 - Since we have no dependancies to remove bi_private now, let's just use
 bi_private pointer as the completion pointer.

Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu 8f99a946f3 f2fs: convert recover_orphan_inodes to void
The recover_orphan_inodes() returns no error all the time, so we don't need to
check its errors.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu 1069bbf7b9 f2fs: check return value of f2fs_readpage in find_data_page
We should return error if we do not get an updated page in find_date_page
when f2fs_readpage failed.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu 01d2d1aa06 f2fs: use true and false for boolean variable
The inode_page_locked should be a boolean variable.

struct dnode_of_data {
	struct inode *inode;            /* vfs inode pointer */
	struct page *inode_page;        /* its inode page, NULL is possible */
	struct page *node_page;         /* cached direct node page */
	nid_t nid;                      /* node id of the direct node block */
	unsigned int ofs_in_node;       /* data offset in the node page */
==>	bool inode_page_locked;         /* inode page is locked or not */
	block_t data_blkaddr;           /* block address of the node block */
};

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu aac44046a2 f2fs: correct type of wait in struct bio_private
The void *wait in bio_private is used for waiting completion of checkpoint bio.
So we don't need to use its type as void, but declare it as completion type.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu 6947eea957 f2fs: avoid to calculate incorrect max orphan number
Because we will write node summaries when do_checkpoint with umount flag,
our number of max orphan blocks should minus NR_CURSEG_NODE_TYPE additional.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu a66c7b2fcf f2fs: remove unneeded code in punch_hole
Because FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE
in fallocate, so we could remove the useless 'keep size' branch code which
will never be excuted in punch_hole.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
[Jaegeuk Kim: remove an unnecessary parameter togather]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim 031fa8cc9b f2fs: remove unnecessary condition checks
This patch removes the unnecessary condition checks on:

fs/f2fs/gc.c:667 do_garbage_collect() warn: 'sum_page' isn't an ERR_PTR
fs/f2fs/f2fs.h:795 f2fs_put_page() warn: 'page' isn't an ERR_PTR

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim f9a4e6df52 f2fs: bug fix on bit overflow from 32bits to 64bits
This patch fixes some bit overflows by the shift operations.

Dan Carpenter reported potential bugs on bit overflows as follows.

fs/f2fs/segment.c:910 submit_write_page()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/checkpoint.c:429 get_valid_checkpoint()
	warn: should '1 << ()' be a 64 bit type?
fs/f2fs/data.c:408 f2fs_readpage()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:457 submit_read_page()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:525 get_data_block_ro()
	warn: should 'i << blkbits' be a 64 bit type?

Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Gu Zheng 3679556794 f2fs: fix a potential out of range issue
Fix a potential out of range issue introduced by commit:
22fb72225a
f2fs: simplify write_orphan_inodes for better readable

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim 0e80220ac5 f2fs: remove unnecessary return value
Let's remove the unnecessary return value.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li 8274de77b7 f2fs: add a new mount option: inline_data
Add a mount option: inline_data. If the mount option is set,
data of New created small files can be stored in their inode.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li 1001b3479c f2fs: add flags and helpers to support inline data
Add new inode flags F2FS_INLINE_DATA and FI_INLINE_DATA to indicate
whether the inode has inline data.

Inline data makes use of inode block's data indices region to save small
file. Currently there are 923 data indices in an inode block. Since
inline xattr has made use of the last 50 indices to save its data, there
are 873 indices left which can be used for inline data. When
FI_INLINE_DATA is set, the layout of inode block's indices region is
like below:

+-----------------+
|                 | Reserved. reserve_new_block() will make use of
| i_addr[0]       | i_addr[0] when we need to reserve a new data block
|                 | to convert inline data into regular one's.
|-----------------|
|                 | Used by inline data. A file whose size is less than
| i_addr[1~872]   | 3488 bytes(~3.4k) and doesn't reserve extra
|                 | blocks by fallocate() can be saved here.
|-----------------|
|                 |
| i_addr[873~922] | Reserved for inline xattr
|                 |
+-----------------+

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Changman Lee 03232305ff f2fs: send REQ_META or REQ_PRIO when reading meta area
Let's send REQ_META or REQ_PRIO when reading meta area such as NAT/SIT
etc.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Jaegeuk Kim a709f4a2f2 f2fs: add detailed information of bio types in the tracepoints
This patch inserts information of bio types in more detail.
So, we can now see REQ_META and REQ_PRIO too.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li b600965c43 f2fs: add a new function: f2fs_reserve_block()
Add the function f2fs_reserve_block() to easily reserve new blocks, and
use it to clean up more codes.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Jaegeuk Kim 0daaad97dc f2fs: avoid lock debugging overhead
If CONFIG_F2FS_CHECK_FS is unset, we don't need to add any debugging overhead.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Chao Yu 74de593af7 f2fs: read contiguous sit entry pages by merging for mount performance
Previously we read sit entries page one by one, this method lost the chance
of reading contiguous page together. So we read pages as contiguous as
possible for better mount performance.

change log:
 o merge judgements/use 'Continue' or 'Break' instead of 'Goto' as Gu Zheng
   suggested.
 o add mark_page_accessed() before release page to delay VM reclaiming.
 o remove '*order' for simplification of function as Jaegeuk Kim suggested.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix a bug on the block address calculation]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu d4d288bc72 f2fs: adds a tracepoint for f2fs_submit_read_bio
This patch adds a tracepoint for f2fs_submit_read_bio.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_bio]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu 87b8872d5b f2fs: adds a tracepoint for submit_read_page
This patch adds a tracepoint for submit_read_page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_page]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Changman Lee 61ae45c880 f2fs: simplify IS_DATASEG and IS_NODESEG macro
It is not efficient comparing each segment type to find node or data.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: remove unnecessary white spaces]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Jaegeuk Kim 7107e0a9b1 f2fs: merge read IOs at ra_nat_pages()
Change log from v1:
  o add mark_page_accessed() not to reclaim the nat pages.

This patch changes the policy of submitting read bios at ra_nat_pages.

Previously, f2fs submits small read bios with block plugging.
But, with this patch, f2fs itself merges read bios first and then submits a
large bio, which can reduce the bio handling overheads.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu 924b720b58 f2fs: add a new function to support for merging contiguous read
For better read performance, we add a new function to support for merging
contiguous read as the one for write.

v1-->v2:
 o add declarations here as Gu Zheng suggested.
 o use new structure f2fs_bio_info introduced by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Acked-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
2013-12-23 10:18:02 +09:00
Gu Zheng ce3b7d80ed f2fs: move the list_head initialization into the lock protection region
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Gu Zheng 502c6e0bcd f2fs: simplify write_orphan_inodes for better readable
Simplify write_orphan_inodes for better readable. Because we hold the
orphan_inode_mutex, so it's safe to use list_for_each_entry instead of
list_for_each_safe.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng ef86d70994 f2fs: convert inc/dec_valid_node_count to inc/dec one count
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng da19b0dc50 f2fs: convert dev_valid_block_count to void
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng 58e674d6ab f2fs: convert remove_inode_page to void
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim 1ff7bd3bb5 f2fs: introduce a bio array for per-page write bios
The f2fs has three bio types, NODE, DATA, and META, and manages some data
structures per each bio types.

The codes are a little bit messy, thus, this patch introduces a bio array
which groups individual data structures as follows.

struct f2fs_bio_info {
	struct bio *bio;		/* bios to merge */
	sector_t last_block_in_bio;	/* last block number */
	struct mutex io_mutex;		/* mutex for bio */
};

struct f2fs_sb_info {
	...
	struct f2fs_bio_info write_io[NR_PAGE_TYPE];	/* for write bios */
	...
};

The code changes from this new data structure are trivial.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim c11abd1a80 f2fs: disable the extent cache ops on high fragmented files
The f2fs manages an extent cache to search a number of consecutive data blocks
very quickly.

However it conducts unnecessary cache operations if the file is highly
fragmented with no valid extent cache.

In such the case, we don't need to handle the extent cache, but just can disable
the cache facility.

Nevertheless, this patch gives one more chance to enable the extent cache.

For example,
1. create a file
2. write data sequentially which produces a large valid extent cache
3. update some data, resulting in a fragmented extent
4. if the fragmented extent is too small, then drop extent cache
5. close the file

6. open the file again
7. give another chance to make a new extent cache
8. write data sequentially again which creates another big extent cache.
...

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim 971767caf6 f2fs: use sbi->write_mutex for write bios
This patch removes an unnecessary semaphore (i.e., sbi->bio_sem).
There is no reason to use the semaphore when f2fs submits read and write IOs.
Instead, let's use a write mutex and cover the sbi->bio[] by the lock.

Change log from v1:
 o split write_mutex suggested by Chao Yu

Chao described,
"All DATA/NODE/META bio buffers in superblock is protected by
'sbi->write_mutex', but each bio buffer area is independent, So we
should split write_mutex to three for DATA/NODE/META."

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim 7d5e510944 f2fs: clean up the do_submit_bio flow
This patch introduces PAGE_TYPE_OF_BIO() and cleans up do_submit_bio() with it.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Chao Yu 75c3c8bc88 f2fs: use f2fs_put_page to release page for uniform style
We should use f2fs_put_page to release page for uniform style of f2fs code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim 1661d07c2d f2fs: add a tracepoint for f2fs_issue_discard
This patch adds a tracepoint for f2fs_issue_discard.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim 3720887910 f2fs: introduce f2fs_issue_discard() to clean up
Change log from v1:
 o fix 32bit drops reported by Dan Carpenter

This patch adds f2fs_issue_discard() to clean up blkdev_issue_discard() flows.

Dan carpenter reported:
"block_t is a 32 bit type and sector_t is a 64 bit type.  The upper 32
bits of the sector_t are not used because the shift will wrap."

Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim 7ac8c3b051 f2fs: add a sysfs entry to control max_discards
If frequent small discards are issued to the device, the performance would
be degraded significantly.
So, this patch adds a sysfs entry to control the number of discards to be
issued during a checkpoint procedure.

By default, f2fs does not issue any small discards, which means max_discards
is zero.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim b29555505d f2fs: add key functions for small discards
This patch adds key functions to activate the small discard feature.

Note that this procedure is conducted during the checkpoint only.

In flush_sit_entries(), when a new dirty sit entry is flushed, f2fs calls
add_discard_addrs() which searches candidates to be discarded.
The candidates should be marked *invalidated* and also previous checkpoint
recognizes it as *valid*.

At the end of a checkpoint procedure, f2fs throws discards based on the
discard entry list.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim 7fd9e544fb f2fs: add a slab cache entry for small discards
This patch adds a slab cache entry for small discards.

Each entry consists of:

struct discard_entry {
	struct list_head list;	/* list head */
	block_t blkaddr;	/* block address to be discarded */
	int len;		/* # of consecutive blocks of the discard */
};

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Changman Lee e81c93cf8c f2fs: improve searching speed of __next_free_blkoff
To find a zero bit using the result of OR operation between ckpt_valid_map
and cur_valid_map is more fast than find a zero bit in each bitmap.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: adjust changed function name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:17:59 +09:00
Changman Lee 9a7f143ab5 f2fs: introduce __find_rev_next(_zero)_bit
When f2fs_set_bit is used, in a byte MSB and LSB is reversed,
in that case we can use __find_rev_next_bit or __find_rev_next_zero_bit.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: change the function names]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:17:59 +09:00
Kent Overstreet 4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Kent Overstreet 2c30c71bd6 block: Convert various code to bio_for_each_segment()
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-23 22:33:46 -08:00
Changman Lee 29e59c14ae f2fs: issue more large discard command
o Changes from v1
  Use find_next(_zero)_bit suggested by jg.kim

When f2fs issues discard command, if segment is contiguous,
let's issue more large segment to gather adjacent segments.

** blktrace **
179,1    0     5859    42.619023770   971  C   D 131072 + 2097152 [0]
179,1    0    33665   108.840475468   971  C   D 2228224 + 2494464 [0]
179,1    0    33671   109.131616427   971  C   D 14909440 + 344064 [0]
179,1    0    33677   109.137100677   971  C   D 15261696 + 4096 [0]

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-11 09:36:32 +09:00
Chao Yu 1d15bd2034 f2fs: fix memory leak after kobject init failed in fill_super
If we failed to init&add kobject when fill_super, stats info and proc object of
f2fs will not be released.
We should free them before we finish fill_super.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-08 14:10:29 +09:00
Changman Lee fb51b5ef9c f2fs: cleanup waiting routine for writeback pages in cp
use genernal method supported by kernel

 o changes from v1
   If any waiter exists at end io, wake up it.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-08 14:10:29 +09:00
Chao Yu 3b03f72445 f2fs: avoid to use a NULL point in destroy_segment_manager
A NULL point should avoid to be used in destroy_segment_manager after allocating
memory fail for f2fs_sm_info.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-06 16:37:44 +09:00
Chao Yu 4bf08ff6f9 f2fs: remove unnecessary TestClearPageError when wait pages writeback
In wait_on_node_pages_writeback we will test and clear error flag for all
pages in radix tree, but not necessary.
So we only do this for pages belong to the specified inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-04 12:24:01 +09:00
Jaegeuk Kim cfe58f9dcd f2fs: avoid to wait all the node blocks during fsync
Previously, f2fs_sync_file() waits for all the node blocks to be written.
But, we don't need to do that, but wait only the inode-related node blocks.

This patch adds wait_on_node_pages_writeback() in which waits inode-related
node blocks that are on writeback.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-31 16:01:03 +09:00
Chao Yu 44c60bf2b9 f2fs: check all ones or zeros bitmap with bitops for better mount performance
Previously, check_block_count check valid_map with bit data type in common
scenario that sit has all ones or zeros bitmap, it makes low mount performance.
So let's check the special bitmap with integer data type instead of the bit one.

v1-->v2:
 o use find_next_{zero_}bit_le for better performance and readable as Jaegeuk
   suggested.
 o use neat logogram in comment as Gu Zheng suggested.
 o search continuous ones or zeros for better performance when checking mixed
   bitmap.

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-30 12:23:23 +09:00
Fan Li 9a47938b22 f2fs: change the method of calculating the number summary blocks
npages_for_summary_flush uses (SUMMARY_SIZE + 1) as the size of a f2fs_summary
while its actual size is  SUMMARY_SIZE. So the result sometimes is bigger than
actual number by one, which causes checkpoint can't be written into disk
contiguously, and sometimes summary blocks can't be compacted like they should.
Besides, when writing summary blocks into pages, if remain space in a page
isn't big enough for one f2fs_summary, it will be left unused, current code
seems not to take it into account.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-30 12:17:58 +09:00
Chao Yu cc3de6a3ac f2fs: fix calculating incorrect free size when update xattr in __f2fs_setxattr
During xattr updating, free size should be corrected to remainder free size
+ old entry size.
It can avoid ENOSPC error when we update old entry with the same size new
entry at fully filled xattr.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-29 15:56:08 +09:00
Jaegeuk Kim 5d56b6718a f2fs: add an option to avoid unnecessary BUG_ONs
If you want to remove unnecessary BUG_ONs, you can just turn off F2FS_CHECK_FS
in your kernel config.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-29 15:44:38 +09:00
Jaegeuk Kim 3b218e3a21 f2fs: introduce CONFIG_F2FS_CHECK_FS for BUG_ON control
This config will support an option to remove so many BUG_ONs that degrade
the performance potentially.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-29 15:43:01 +09:00
Jaegeuk Kim 2ed2d5b33c f2fs: fix a deadlock during init_acl procedure
The deadlock is found through the following scenario.

sys_mkdir()
 -> f2fs_add_link()
  -> __f2fs_add_link()
   -> init_inode_metadata()
     : lock_page(inode);
    -> f2fs_init_acl()
     -> f2fs_set_acl()
      -> f2fs_setxattr(..., NULL)
       : This NULL page incurs a deadlock at update_inode_page().

So, likewise f2fs_init_security(), this patch adds a parameter to transfer the
locked inode page to f2fs_setxattr().

Found by Linux File System Verification project (linuxtesting.org).

Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-28 13:39:09 +09:00
Jaegeuk Kim b8b60e1a65 f2fs: clean up acl flow for better readability
This patch cleans up a couple of acl codes.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-28 13:38:21 +09:00
Changman Lee 4625d6aac2 f2fs: remove unnecessary segment bitmap updates
Only one dirty type is set in __locate_dirty_segment and we can know
dirty type of segment. So we don't need to check other dirty types.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-28 13:38:16 +09:00
Jaegeuk Kim e943a10d94 f2fs: add tracepoint for vm_page_mkwrite
This patch adds a tracepoint for f2fs_vm_page_mkwrite.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:40 +09:00
Jaegeuk Kim 26c6b88799 f2fs: add tracepoint for set_page_dirty
This patch adds a tracepoint for set_page_dirty.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:40 +09:00
Chao Yu e8d61a7488 f2fs: remove redundant set_page_dirty from write_compacted_summaries
Previously, set_page_dirty is called every time after writting one summary info
into compacted summary page,
To avoid redundant set_page_dirty, we only call set_page_dirty before release
page.

Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:39 +09:00
Jaegeuk Kim ea91e9b043 f2fs: add reclaiming control by sysfs
This patch adds a control method in sysfs to reclaim prefree segments.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:39 +09:00
Jaegeuk Kim 4660f9c0fe f2fs: introduce f2fs_balance_fs_bg for some background jobs
This patch merges some background jobs into this new function.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:38 +09:00
Jaegeuk Kim 81eb8d6e28 f2fs: reclaim prefree segments periodically
Previously, f2fs postpones reclaiming prefree segments into free segments
as much as possible.
However, if user writes and deletes a bunch of data without any sync or fsync
calls, some flash storages can suffer from garbage collections.

So, this patch adds the reclaiming codes to f2fs_write_node_pages and background
GC thread.

If there are a lot of prefree segments, let's do checkpoint so that f2fs
submits discard commands for the prefree regions to the flash storage.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:37 +09:00
Haicheng Li aabe51364f f2fs: use bool for booleans
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:37 +09:00
Jaegeuk Kim dcdfff6527 f2fs: clean up several status-related operations
This patch cleans up improper definitions that update some status information.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-25 16:54:08 +09:00
Gu Zheng 7bd59381c8 f2fs: introduce f2fs_kmem_cache_alloc to hide the unfailed, kmem cache allocation
Introduce the unfailed version of kmem_cache_alloc named f2fs_kmem_cache_alloc
to hide the retry routine and make the code a bit cleaner.

v2:
   Fix the wrong use of 'retry' tag pointed out by Gao feng.
   Use more neat code to remove redundant tag suggested by Haicheng Li.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-22 20:16:02 +09:00
Haicheng Li 435f2a1b58 f2fs: no need to check other dirty_segmap when the seg has been found
Because one dirty seg can only be mapped to one dirty_type. Otherwise, it's a bug.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: modify a comment related to this patch]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-22 19:57:31 +09:00
Haicheng Li cffbfa6648 f2fs: use true and false for boolean value
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-22 19:49:39 +09:00
Jaegeuk Kim 87a9bd2656 f2fs: avoid to write during the recovery
This patch enhances the recovery routine not to write any data/node/meta until
its completion.
If any writes are sent to the disk, it could contaminate the written history
that will be used for further recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-18 09:44:14 +09:00
Gu Zheng e234088758 f2fs: avoid wait if IO end up when do_checkpoint for better performance
Previously, do_checkpoint() will call congestion_wait() for waiting the pages
(previous submitted node/meta/data pages) to be written back.
Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and
no additional wake up mechanism was introduced if IO ends up before regular period costed.
Yuan Zhong found there is a situation that after the pages have been written back,
but the checkpoint thread still wait for congestion_wait to exit.

So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes
if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up.

Thanks to Yuan Zhong's pre work about this problem.

Reported-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-18 09:44:14 +09:00
Gu Zheng 9076a75f8e f2fs: introduce function read_raw_super_block()
Introduce function read_raw_super_block() to hide reading raw super block and
the retry routine if the first sb is invalid.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-18 09:44:13 +09:00
Jaegeuk Kim b1838f8952 f2fs: fix the starvation problem on cp_rwsem
This patch removes the logic previously introduced to address the starvation
on cp_rwsem.

One potential there-in bug is that we should cover the wait.list with spin_lock,
but the previous code broke this rule.

And, actually current rwsem handles this starvation issue reasonably, so that we
didn't need to do this before neither.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-18 09:44:13 +09:00
Jaegeuk Kim 3d1e38073b f2fs: fix to store and retrieve i_rdev correctly
When storing i_rdev, we should check its file type.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-18 09:43:38 +09:00
Jaegeuk Kim ccaaca2591 f2fs: fix writing incorrect orphan blocks
Previously, there was a erroneous scenario like below.
thread 1:                       thread 2:
 f2fs_unlink
  - acquire_orphan_inode
    : sbi->n_orphans++           write_checkpoint
                                 - block_operations
                                  : f2fs_lock_all
                                 - do_checkpoint
                                  : write orphan blocks with sbi->n_orphans
                                 - unblock_operations
  - f2fs_lock_op
  - release_orphan_inode
  - f2fs_unlock_op

During the checkpoint by thread 2, f2fs stores a wrong orphan block according
to the wrong sbi->n_orphans.
To avoid this, simply we should make cover acquire_orphan_inode too with
f2fs_lock_op.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-08 10:19:28 +09:00
Jaegeuk Kim 5887d291d7 f2fs: avoid unnecessary checkpoints
During the f2fs_put_super procedure, we don't need to conduct checkpoint all
the time, since we don't need to do that if superblock is clean.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-08 09:32:43 +09:00
Kelly Anderson 4058c5117d f2fs: handle remount options correctly
The current f2fs code errors if the xattr or acl options are passed when
remounting.  This is important in a typical scenario where f2fs is mounted
as a "ro" root file-system by the boot loader and then the init process wants
to remount it "rw" with the "remount,rw" option.

Signed-off-by: Kelly Anderson <kelly@xilka.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-07 11:38:13 +09:00
Gu Zheng e479556bfd f2fs: use rw_sem instead of fs_lock(locks mutex)
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
there is a terrible problem here, if these are too many concurrency threads acquiring
fs_lock, so that they will block each other and may lead to some performance problem,
but this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
this can avoid the problem described above completely.
Because of the weakness of rw_sem, the above change may introduce a potential problem
that the checkpoint thread might get starved if other threads are intensively locking
the read semaphore for I/O.(Pointed out by Xu Jin)
In order to avoid this, a wait_list is introduced, the appending read semaphore ops
will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
and will be waked up when checkpoint thread gives up write semaphore.
Thanks to Kim's previous review and test, and will be very glad to see other guys'
performance tests about this patch.

V2:
  -fix the potential starvation problem.
  -use more suitable func name suggested by Xu Jin.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: adjust minor coding standard]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-10-07 11:33:05 +09:00
Russ W. Knize 2e5558f4a5 f2fs: account for orphan inodes during recovery
During recovery, orphan inodes are deleted via truncate_hole().
These orphans are added by recover_dentry() via f2fs_delete_entry().
However, f2fs_delete_entry() adds them via add_orphan_inode()
without calling acquire_orphan_inode() first.  This prevents the
counters from being incremented properly, which causes them to
underflow when remove_orphan_inode() is called later on.

Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:59:32 +09:00
Russ Knize 52ab956000 f2fs: don't GC or take an fs_lock from f2fs_initxattrs()
f2fs_initxattrs() is called internally from within F2FS and should
not call functions that are used by VFS handlers.  This avoids
certain deadlocks:

- vfs_create()
 - f2fs_create() <-- takes an fs_lock
  - f2fs_add_link()
   - __f2fs_add_link()
    - init_inode_metadata()
     - f2fs_init_security()
      - security_inode_init_security()
       - f2fs_initxattrs()
        - f2fs_setxattr() <-- also takes an fs_lock

If the caller happens to grab the same fs_lock from the pool in both
places, they will deadlock.  There are also deadlocks involving
multiple threads and mutexes:

- f2fs_write_begin()
 - f2fs_balance_fs() <-- takes gc_mutex
  - f2fs_gc()
   - write_checkpoint()
    - block_operations()
     - mutex_lock_all() <-- blocks trying to grab all fs_locks

- f2fs_mkdir() <-- takes an fs_lock
 - __f2fs_add_link()
  - f2fs_init_security()
   - security_inode_init_security()
    - f2fs_initxattrs()
     - f2fs_setxattr()
      - f2fs_balance_fs() <-- blocks trying to take gc_mutex

Signed-off-by: Russ Knize <Russ.Knize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:51:24 +09:00
Russ W. Knize 885166c03c f2fs: don't let the orphan inode counter underflow
Accounting errors from buggy code calling the acquire/release/remove
orphan inode interfaces can cause n_orphans to underflow, which will
then cause acquire_orphan_inode() to return -ENOSPC on the next
operation.  This commit guards against that condition.

Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:49:12 +09:00
Chao Yu 691c6fd2a2 f2fs: remove unneeded write checkpoint in recover_fsync_data
Previously, recover_fsync_data still to write checkpoint when there is
nothing to recover with normal umount image.
It may reduce mount performance and flash memory lifetime, so let's remove
it.

Signed-off-by: Tan Shu <shu.tan@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-25 17:18:16 +09:00
Chao Yu cc7b1bb173 f2fs: avoid allocating failure in bio_alloc
This patch add macro MAX_BIO_BLOCKS to limit value of npages in
f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
npages is larger than BIO_MAX_PAGES.

Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Jin Xu a57e564d14 f2fs: optimize the victim searching loop slightly
Since the MAX_VICTIM_SEARCH has been enlarged from 20 to 4096,
the victim searching overhead will be increased much than before,
especially for SSR that searches victim for use quiet often.
This patch intends to reduce the overhead a little bit by:
- make the get_gc_cost a inline routine to reduce function call
  overhead
- reduce multiplication and division operations
- reduce unnecessary comparison operation

Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Yu Chao e76eebee70 f2fs: optimize fs_lock for better performance
There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded.

v1-->v2:
	Drop the needless spin_lock as Jaegeuk suggested.

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-24 17:45:48 +09:00
Jin Xu a26b7c8a01 f2fs: optimize gc for better performance
This patch improves the gc efficiency by optimizing the victim
selection policy. With this optimization, the random re-write
performance could increase up to 20%.

For f2fs, when disk is in shortage of free spaces, gc will selects
dirty segments and moves valid blocks around for making more space
available. The gc cost of a segment is determined by the valid blocks
in the segment. The less the valid blocks, the higher the efficiency.
The ideal victim segment is the one that has the most garbage blocks.

Currently, it searches up to 20 dirty segments for a victim segment.
The selected victim is not likely the best victim for gc when there
are much more dirty segments. Why not searching more dirty segments
for a better victim? The cost of searching dirty segments is
negligible in comparison to moving blocks.

In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make
the search more aggressively for a possible better victim. Since
it also applies to victim selection for SSR, it will likely improve
the SSR efficiency as well.

The test case is simple. It creates as many files until the disk full.
The size for each file is 32KB. Then it writes as many as 100000
records of 4KB size to random offsets of random files in sync mode.
The testing was done on a 2GB partition of a SDHC card. Let's see the
test result of f2fs without and with the patch.

---------------------------------------
2GB partition, SDHC
create 52023 files of size 32768 bytes
random re-write 100000 records of 4KB
---------------------------------------
| file creation (s) | rewrite time (s) | gc count | gc garbage blocks |
[no patch]  341         4227             1174          174840
[patched]   324         2958             645           106682

It's obvious that, with the patch, f2fs finishes the test in 20+% less
time than without the patch. And internally it does much less gc with
higher efficiency than before.

Since the performance improvement is related to gc, it might not be so
obvious for other tests that do not trigger gc as often as this one (
This is because f2fs selects dirty segments for SSR use most of the
time when free space is in shortage). The well-known iozone test tool
was not used for benchmarking the patch becuase it seems do not have
a test case that performs random re-write on a full disk.

This patch is the revised version based on the suggestion from
Jaegeuk Kim.

Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: suggested simpler solution]
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-05 13:50:32 +09:00
Jaegeuk Kim 423e95ccbe f2fs: merge more bios of node block writes
Previously, we experience bio traces as follows when running simple sequential
write test.

 f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K
 f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K
 f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K

 -> total 512K

The first one is to write an indirect node block, and the others are to write
direct node blocks.

The reason why there are two separate bios for direct node blocks is:
0. initial state
------------------    ------------------
|                |    |xxxxxxxx        |
------------------    ------------------

1. write 368K
------------------    ------------------
|                |    |xxxxxxxxWWWWWWWW|
------------------    ------------------

2. write 140K
------------------    ------------------
|WWWWWWW         |    |xxxxxxxxWWWWWWWW|
------------------    ------------------

This is because f2fs_write_node_pages tries to write just 512K totally, so that
we can lose the chance to merge more bios nicely.

After this patch is applied, we can get the following bio traces.

  f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K
  f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K
  f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K
  f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K
  f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K

And finally, we can improve the sequential write performance,
    from 458.775 MB/s to 479.945 MB/s on SSD.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-05 10:17:19 +09:00
Jaegeuk Kim 222cbdc483 f2fs: avoid an overflow during utilization calculation
The current f2fs uses all the block counts with 32 bit numbers, which is able to
cover about 15TB volume.

But in calculation of utilization, f2fs multiplies the count by 100 which can
induce overflow.
This patch fixes this.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-03 13:41:37 +09:00
Jaegeuk Kim c34e333fd5 f2fs: trigger GC when there are prefree segments
Previously, f2fs conducts SSR when free_sections() < overprovision_sections.
But, even though there are a lot of prefree segments, it can consider SSR only.
So, let's consider the number of prefree segments too for triggering SSR.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-09-03 10:11:20 +09:00
Gu Zheng 749ebfd174 f2fs: use strncasecmp() simplify the string comparison
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-27 21:50:12 +09:00
Jaegeuk Kim 8cb8268809 f2fs: fix omitting to update inode page
The f2fs_set_link updates its parent inode number, so we should sync this to
the inode block.
Otherwise, the data can be lost after sudden-power-off.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-27 21:49:04 +09:00
Jaegeuk Kim 65985d935d f2fs: support the inline xattrs
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]

inline xattrs (200 bytes by default)

indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------

1. setxattr flow
 - read_all_xattrs copies all the xattrs from inline and xattr node block.
 - handle xattr entries
 - write_all_xattrs copies modified xattrs into inline and xattr node block.

2. getxattr flow
 - read_all_xattrs copies all the xattrs from inline and xattr node block.
 - check target entries

3. Usage
 # mount -t f2fs -o inline_xattr $DEV $MNT

 Once mounted with the inline_xattr option, f2fs marks all the newly created
 files to reserve an amount of inline xattr space explicitly inside the inode
 block. Without the mount option, f2fs will not touch any existing files and
 newly created files as well.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:23 +09:00
Jaegeuk Kim 4f16fb0f9b f2fs: add the truncate_xattr_node function
The truncate_xattr_node function will be used by inline xattr.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:06 +09:00
Jaegeuk Kim dd9cfe236f f2fs: introduce __find_xattr for readability
The __find_xattr is to search the wanted xattr entry starting from the
base_addr.

If not found, the returned entry is the last empty xattr entry that can be
allocated newly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:06 +09:00
Jaegeuk Kim de93653fe3 f2fs: reserve the xattr space dynamically
This patch enables the number of direct pointers inside on-disk inode block to
be changed dynamically according to the size of inline xattr space.

The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
has inline xattr flag.

The number of direct pointers that will be used by inline xattrs is defined as
F2FS_INLINE_XATTR_ADDRS.
Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:01 +09:00
Jaegeuk Kim 444c580f7e f2fs: add flags for inline xattrs
This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
and add a mount option, inline_xattr, which is enabled when xattr is set.

If the mount option is enabled, all the files are marked with the inline_xattrs
flag.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:02:12 +09:00
Wei Yongjun 6e6b978c32 f2fs: fix error return code in init_f2fs_fs()
Fix to return -ENOMEM in the kset create and add error handling
case instead of 0, as done elsewhere in this function.

Introduced by commit b59d0bae6c.
(f2fs: add sysfs support for controlling the gc_thread)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
[Jaegeuk Kim: merge the patch with previous modification]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 19:36:46 +09:00