linux

Commit Graph

Author	SHA1	Message	Date
Chris Mason	56bec294de	Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	b4ce94de9b	Btrfs: Change btree locking to use explicit blocking points Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:25:08 -05:00
Chris Mason	d397712bcc	Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 21:25:51 -05:00
Josef Bacik	2517920135	Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-29 14:49:05 -04:00
Chris Mason	d352ac6814	Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-29 15:18:18 -04:00
Chris Mason	e02119d5a7	Btrfs: Add a write ahead tree log to optimize synchronous operations File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	3f157a2fd2	Btrfs: Online btree defragmentation fixes The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	1b1e2135dc	Btrfs: Add a per-inode csum mutex to avoid races creating csum items Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	e7a84565bc	Btrfs: Add btree locking to the tree defragmentation code The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	925baeddc5	Btrfs: Start btree concurrency work. The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	0ef3e66b67	Btrfs: Allocator fix variety pack * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	1259ab75c6	Btrfs: Handle write errors on raid1 and raid10 When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	ca7a79ad8d	Btrfs: Pass down the expected generation number when reading tree blocks Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	0999df54f8	Btrfs: Verify checksums on tree blocks found without read_tree_block Checksums were only verified by btrfs_read_tree_block, which meant the functions to probe the page cache for blocks were not validating checksums. Normally this is fine because the buffers will only be in cache if they have already been validated. But, there is a window while the buffer is being read from disk where it could be up to date in the cache but not yet verified. This patch makes sure all buffers go through checksum verification before they are used. This is safer, and it prevents modification of buffers before they go through the csum code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	9afbb0b752	Btrfs: Disable tree defrag in SSD mode Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:00 -04:00
Chris Mason	b3236e68bf	Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:00 -04:00
Chris Mason	e18e4809b1	Btrfs: Add mount -o ssd, which includes optimizations for seek free storage Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:59 -04:00
Chris Mason	7bb86316c3	Btrfs: Add back pointers from extents to the btree or file referencing them Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:58 -04:00
Chris Mason	f84a8b362d	Btrfs: Optimize allocations as we need to mix data and metadata into one group Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:57 -04:00
Chris Mason	081e95736d	Btrfs: Make defrag check nodes against the progress key to prevent repeating work Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:57 -04:00
Chris Mason	5708b95916	Btrfs: Tune the automatic defrag code 1) Forced defrag wasn't working properly (btrfsctl -d) because some cache only checks were incorrect. 2) Defrag only the leaves unless in forced defrag mode. 3) Don't use complex logic to figure out if a leaf is needs defrag Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:57 -04:00
Chris Mason	a6b6e75e09	Btrfs: Defrag only leaves, and only when the parent node has a single objectid This allows us to defrag huge directories, but skip the expensive defrag case in more common usage, where it does not help as much. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:57 -04:00
Chris Mason	cf786e79e3	Btrfs: Defrag: only walk into nodes with the defrag bit set Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	0f1ebbd159	Btrfs: Large block related defrag optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	0f82731fc5	Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	6b80053d02	Btrfs: Add back the online defragging code Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	db94535db7	Btrfs: Allow tree blocks larger than the page size Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	5f39d397df	Btrfs: Create extent_buffer interface for large blocksizes Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:03:56 -04:00
Chris Mason	86479a04ee	Add support for defragging files via btrfsctl -d. Avoid OOM on extent tree defrag. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2007-09-10 19:58:16 -04:00
Chris Mason	f2183bde1a	Btrfs: Add BH_Defrag to mark buffers that are in need of defragging This allows the tree walking code to defrag only the newly allocated buffers, it seems to be a good balance between perfect defragging and the performance hit of repeatedly reallocating blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2007-08-10 14:42:37 -04:00
Chris Mason	e9d0b13b5b	Btrfs: Btree defrag on the extent-mapping tree as well Signed-off-by: Chris Mason <chris.mason@oracle.com>	2007-08-10 14:06:19 -04:00
Chris Mason	409eb95d7f	Btrfs: Further reduce the concurrency penalty of defrag and drop_snapshot Signed-off-by: Chris Mason <chris.mason@oracle.com>	2007-08-08 20:17:12 -04:00
Chris Mason	6702ed490c	Btrfs: Add run time btree defrag, and an ioctl to force btree defrag This adds two types of btree defrag, a run time form that tries to defrag recently allocated blocks in the btree when they are still in ram, and an ioctl that forces defrag of all btree blocks. File data blocks are not defragged yet, but this can make a huge difference in sequential btree reads. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2007-08-07 16:15:09 -04:00

33 Commits