o Changes from v1
Use find_next(_zero)_bit suggested by jg.kim
When f2fs issues discard command, if segment is contiguous,
let's issue more large segment to gather adjacent segments.
** blktrace **
179,1 0 5859 42.619023770 971 C D 131072 + 2097152 [0]
179,1 0 33665 108.840475468 971 C D 2228224 + 2494464 [0]
179,1 0 33671 109.131616427 971 C D 14909440 + 344064 [0]
179,1 0 33677 109.137100677 971 C D 15261696 + 4096 [0]
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If we failed to init&add kobject when fill_super, stats info and proc object of
f2fs will not be released.
We should free them before we finish fill_super.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
use genernal method supported by kernel
o changes from v1
If any waiter exists at end io, wake up it.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
A NULL point should avoid to be used in destroy_segment_manager after allocating
memory fail for f2fs_sm_info.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In wait_on_node_pages_writeback we will test and clear error flag for all
pages in radix tree, but not necessary.
So we only do this for pages belong to the specified inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs_sync_file() waits for all the node blocks to be written.
But, we don't need to do that, but wait only the inode-related node blocks.
This patch adds wait_on_node_pages_writeback() in which waits inode-related
node blocks that are on writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, check_block_count check valid_map with bit data type in common
scenario that sit has all ones or zeros bitmap, it makes low mount performance.
So let's check the special bitmap with integer data type instead of the bit one.
v1-->v2:
o use find_next_{zero_}bit_le for better performance and readable as Jaegeuk
suggested.
o use neat logogram in comment as Gu Zheng suggested.
o search continuous ones or zeros for better performance when checking mixed
bitmap.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
npages_for_summary_flush uses (SUMMARY_SIZE + 1) as the size of a f2fs_summary
while its actual size is SUMMARY_SIZE. So the result sometimes is bigger than
actual number by one, which causes checkpoint can't be written into disk
contiguously, and sometimes summary blocks can't be compacted like they should.
Besides, when writing summary blocks into pages, if remain space in a page
isn't big enough for one f2fs_summary, it will be left unused, current code
seems not to take it into account.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During xattr updating, free size should be corrected to remainder free size
+ old entry size.
It can avoid ENOSPC error when we update old entry with the same size new
entry at fully filled xattr.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If you want to remove unnecessary BUG_ONs, you can just turn off F2FS_CHECK_FS
in your kernel config.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This config will support an option to remove so many BUG_ONs that degrade
the performance potentially.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The deadlock is found through the following scenario.
sys_mkdir()
-> f2fs_add_link()
-> __f2fs_add_link()
-> init_inode_metadata()
: lock_page(inode);
-> f2fs_init_acl()
-> f2fs_set_acl()
-> f2fs_setxattr(..., NULL)
: This NULL page incurs a deadlock at update_inode_page().
So, likewise f2fs_init_security(), this patch adds a parameter to transfer the
locked inode page to f2fs_setxattr().
Found by Linux File System Verification project (linuxtesting.org).
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Only one dirty type is set in __locate_dirty_segment and we can know
dirty type of segment. So we don't need to check other dirty types.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, set_page_dirty is called every time after writting one summary info
into compacted summary page,
To avoid redundant set_page_dirty, we only call set_page_dirty before release
page.
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a control method in sysfs to reclaim prefree segments.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch merges some background jobs into this new function.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs postpones reclaiming prefree segments into free segments
as much as possible.
However, if user writes and deletes a bunch of data without any sync or fsync
calls, some flash storages can suffer from garbage collections.
So, this patch adds the reclaiming codes to f2fs_write_node_pages and background
GC thread.
If there are a lot of prefree segments, let's do checkpoint so that f2fs
submits discard commands for the prefree regions to the flash storage.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce the unfailed version of kmem_cache_alloc named f2fs_kmem_cache_alloc
to hide the retry routine and make the code a bit cleaner.
v2:
Fix the wrong use of 'retry' tag pointed out by Gao feng.
Use more neat code to remove redundant tag suggested by Haicheng Li.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Because one dirty seg can only be mapped to one dirty_type. Otherwise, it's a bug.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: modify a comment related to this patch]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances the recovery routine not to write any data/node/meta until
its completion.
If any writes are sent to the disk, it could contaminate the written history
that will be used for further recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, do_checkpoint() will call congestion_wait() for waiting the pages
(previous submitted node/meta/data pages) to be written back.
Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and
no additional wake up mechanism was introduced if IO ends up before regular period costed.
Yuan Zhong found there is a situation that after the pages have been written back,
but the checkpoint thread still wait for congestion_wait to exit.
So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes
if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up.
Thanks to Yuan Zhong's pre work about this problem.
Reported-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce function read_raw_super_block() to hide reading raw super block and
the retry routine if the first sb is invalid.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes the logic previously introduced to address the starvation
on cp_rwsem.
One potential there-in bug is that we should cover the wait.list with spin_lock,
but the previous code broke this rule.
And, actually current rwsem handles this starvation issue reasonably, so that we
didn't need to do this before neither.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, there was a erroneous scenario like below.
thread 1: thread 2:
f2fs_unlink
- acquire_orphan_inode
: sbi->n_orphans++ write_checkpoint
- block_operations
: f2fs_lock_all
- do_checkpoint
: write orphan blocks with sbi->n_orphans
- unblock_operations
- f2fs_lock_op
- release_orphan_inode
- f2fs_unlock_op
During the checkpoint by thread 2, f2fs stores a wrong orphan block according
to the wrong sbi->n_orphans.
To avoid this, simply we should make cover acquire_orphan_inode too with
f2fs_lock_op.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During the f2fs_put_super procedure, we don't need to conduct checkpoint all
the time, since we don't need to do that if superblock is clean.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The current f2fs code errors if the xattr or acl options are passed when
remounting. This is important in a typical scenario where f2fs is mounted
as a "ro" root file-system by the boot loader and then the init process wants
to remount it "rw" with the "remount,rw" option.
Signed-off-by: Kelly Anderson <kelly@xilka.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
there is a terrible problem here, if these are too many concurrency threads acquiring
fs_lock, so that they will block each other and may lead to some performance problem,
but this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
this can avoid the problem described above completely.
Because of the weakness of rw_sem, the above change may introduce a potential problem
that the checkpoint thread might get starved if other threads are intensively locking
the read semaphore for I/O.(Pointed out by Xu Jin)
In order to avoid this, a wait_list is introduced, the appending read semaphore ops
will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
and will be waked up when checkpoint thread gives up write semaphore.
Thanks to Kim's previous review and test, and will be very glad to see other guys'
performance tests about this patch.
V2:
-fix the potential starvation problem.
-use more suitable func name suggested by Xu Jin.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: adjust minor coding standard]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During recovery, orphan inodes are deleted via truncate_hole().
These orphans are added by recover_dentry() via f2fs_delete_entry().
However, f2fs_delete_entry() adds them via add_orphan_inode()
without calling acquire_orphan_inode() first. This prevents the
counters from being incremented properly, which causes them to
underflow when remove_orphan_inode() is called later on.
Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
f2fs_initxattrs() is called internally from within F2FS and should
not call functions that are used by VFS handlers. This avoids
certain deadlocks:
- vfs_create()
- f2fs_create() <-- takes an fs_lock
- f2fs_add_link()
- __f2fs_add_link()
- init_inode_metadata()
- f2fs_init_security()
- security_inode_init_security()
- f2fs_initxattrs()
- f2fs_setxattr() <-- also takes an fs_lock
If the caller happens to grab the same fs_lock from the pool in both
places, they will deadlock. There are also deadlocks involving
multiple threads and mutexes:
- f2fs_write_begin()
- f2fs_balance_fs() <-- takes gc_mutex
- f2fs_gc()
- write_checkpoint()
- block_operations()
- mutex_lock_all() <-- blocks trying to grab all fs_locks
- f2fs_mkdir() <-- takes an fs_lock
- __f2fs_add_link()
- f2fs_init_security()
- security_inode_init_security()
- f2fs_initxattrs()
- f2fs_setxattr()
- f2fs_balance_fs() <-- blocks trying to take gc_mutex
Signed-off-by: Russ Knize <Russ.Knize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Accounting errors from buggy code calling the acquire/release/remove
orphan inode interfaces can cause n_orphans to underflow, which will
then cause acquire_orphan_inode() to return -ENOSPC on the next
operation. This commit guards against that condition.
Signed-off-by: Russ Knize <rknize@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, recover_fsync_data still to write checkpoint when there is
nothing to recover with normal umount image.
It may reduce mount performance and flash memory lifetime, so let's remove
it.
Signed-off-by: Tan Shu <shu.tan@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch add macro MAX_BIO_BLOCKS to limit value of npages in
f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
npages is larger than BIO_MAX_PAGES.
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Since the MAX_VICTIM_SEARCH has been enlarged from 20 to 4096,
the victim searching overhead will be increased much than before,
especially for SSR that searches victim for use quiet often.
This patch intends to reduce the overhead a little bit by:
- make the get_gc_cost a inline routine to reduce function call
overhead
- reduce multiplication and division operations
- reduce unnecessary comparison operation
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded.
v1-->v2:
Drop the needless spin_lock as Jaegeuk suggested.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch improves the gc efficiency by optimizing the victim
selection policy. With this optimization, the random re-write
performance could increase up to 20%.
For f2fs, when disk is in shortage of free spaces, gc will selects
dirty segments and moves valid blocks around for making more space
available. The gc cost of a segment is determined by the valid blocks
in the segment. The less the valid blocks, the higher the efficiency.
The ideal victim segment is the one that has the most garbage blocks.
Currently, it searches up to 20 dirty segments for a victim segment.
The selected victim is not likely the best victim for gc when there
are much more dirty segments. Why not searching more dirty segments
for a better victim? The cost of searching dirty segments is
negligible in comparison to moving blocks.
In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make
the search more aggressively for a possible better victim. Since
it also applies to victim selection for SSR, it will likely improve
the SSR efficiency as well.
The test case is simple. It creates as many files until the disk full.
The size for each file is 32KB. Then it writes as many as 100000
records of 4KB size to random offsets of random files in sync mode.
The testing was done on a 2GB partition of a SDHC card. Let's see the
test result of f2fs without and with the patch.
---------------------------------------
2GB partition, SDHC
create 52023 files of size 32768 bytes
random re-write 100000 records of 4KB
---------------------------------------
| file creation (s) | rewrite time (s) | gc count | gc garbage blocks |
[no patch] 341 4227 1174 174840
[patched] 324 2958 645 106682
It's obvious that, with the patch, f2fs finishes the test in 20+% less
time than without the patch. And internally it does much less gc with
higher efficiency than before.
Since the performance improvement is related to gc, it might not be so
obvious for other tests that do not trigger gc as often as this one (
This is because f2fs selects dirty segments for SSR use most of the
time when free space is in shortage). The well-known iozone test tool
was not used for benchmarking the patch becuase it seems do not have
a test case that performs random re-write on a full disk.
This patch is the revised version based on the suggestion from
Jaegeuk Kim.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: suggested simpler solution]
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we experience bio traces as follows when running simple sequential
write test.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K
-> total 512K
The first one is to write an indirect node block, and the others are to write
direct node blocks.
The reason why there are two separate bios for direct node blocks is:
0. initial state
------------------ ------------------
| | |xxxxxxxx |
------------------ ------------------
1. write 368K
------------------ ------------------
| | |xxxxxxxxWWWWWWWW|
------------------ ------------------
2. write 140K
------------------ ------------------
|WWWWWWW | |xxxxxxxxWWWWWWWW|
------------------ ------------------
This is because f2fs_write_node_pages tries to write just 512K totally, so that
we can lose the chance to merge more bios nicely.
After this patch is applied, we can get the following bio traces.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K
And finally, we can improve the sequential write performance,
from 458.775 MB/s to 479.945 MB/s on SSD.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The current f2fs uses all the block counts with 32 bit numbers, which is able to
cover about 15TB volume.
But in calculation of utilization, f2fs multiplies the count by 100 which can
induce overflow.
This patch fixes this.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs conducts SSR when free_sections() < overprovision_sections.
But, even though there are a lot of prefree segments, it can consider SSR only.
So, let's consider the number of prefree segments too for triggering SSR.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs_set_link updates its parent inode number, so we should sync this to
the inode block.
Otherwise, the data can be lost after sudden-power-off.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]
inline xattrs (200 bytes by default)
indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------
1. setxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- handle xattr entries
- write_all_xattrs copies modified xattrs into inline and xattr node block.
2. getxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- check target entries
3. Usage
# mount -t f2fs -o inline_xattr $DEV $MNT
Once mounted with the inline_xattr option, f2fs marks all the newly created
files to reserve an amount of inline xattr space explicitly inside the inode
block. Without the mount option, f2fs will not touch any existing files and
newly created files as well.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The __find_xattr is to search the wanted xattr entry starting from the
base_addr.
If not found, the returned entry is the last empty xattr entry that can be
allocated newly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>