commit 31b2212019 upstream.
The dm-writecache reads metadata in the target constructor. However, when
we reload the target, there could be another active instance running on
the same device. This is the sequence of operations when doing a reload:
1. construct new target
2. suspend old target
3. resume new target
4. destroy old target
Metadata that were written by the old target between steps 1 and 2 would
not be visible by the new target.
Fix the data corruption by loading the metadata in the resume handler.
Also, validate block_size is at least as large as both the devices'
logical block size and only read 1 block from the metadata during
target constructor -- no need to read entirety of metadata now that it
is done during resume.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ad4e80a639 upstream.
The error correction data is computed as if data and hash blocks
were concatenated. But hash block number starts from v->hash_start.
So, we have to calculate hash block number based on that.
Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org
Signed-off-by: Sunwook Eom <speed.eom@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9fc06ff568 ]
Add missing casts when converting from regions to sectors.
In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
to overflows and miscalculation of the device sector.
As a result, we could end up discarding and/or copying the wrong parts
of the device, thus corrupting the device's data.
Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4b5142905d ]
There is a bug in the way dm-clone handles discards, which can lead to
discarding the wrong blocks or trying to discard blocks beyond the end
of the device.
This could lead to data corruption, if the destination device indeed
discards the underlying blocks, i.e., if the discard operation results
in the original contents of a block to be lost.
The root of the problem is the code that calculates the range of regions
covered by a discard request and decides which regions to discard.
Since dm-clone handles the device in units of regions, we don't discard
parts of a region, only whole regions.
The range is calculated as:
rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
re = bio_end_sector(bio) >> clone->region_shift;
, where 'rs' is the first region to discard and (re - rs) is the number
of regions to discard.
The bug manifests when we try to discard part of a single region, i.e.,
when we try to discard a block with size < region_size, and the discard
request both starts at an offset with respect to the beginning of that
region and ends before the end of the region.
The root cause is the following comparison:
if (rs == re)
// skip discard and complete original bio immediately
, which doesn't take into account that 'rs' might be greater than 're'.
Thus, we then issue a discard request for the wrong blocks, instead of
skipping the discard all together.
Fix the check to also take into account the above case, so we don't end
up discarding the wrong blocks.
Also, add some range checks to dm_clone_set_region_hydrated() and
dm_clone_cond_set_range(), which update dm-clone's region bitmap.
Note that the aforementioned bug doesn't cause invalid memory accesses,
because dm_clone_is_range_hydrated() returns True for this case, so the
checks are just precautionary.
Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6ca43ed837 ]
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.
spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b8fdd09037 ]
zmd->nr_rnd_zones was increased twice by mistake. The other place it
is increased in dmz_init_zone() is the only one needed:
1131 zmd->nr_useable_zones++;
1132 if (dmz_is_rnd(zone)) {
1133 zmd->nr_rnd_zones++;
^^^
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 81d5553d12 upstream.
dm_clone_nr_of_hydrated_regions() returns the number of regions that
have been hydrated so far. In order to do so it employs bitmap_weight().
Until now, the return type of dm_clone_nr_of_hydrated_regions() was
unsigned long.
Because bitmap_weight() returns an int, in case BITS_PER_LONG == 64 and
the return value of bitmap_weight() is 2^31 (the maximum allowed number
of regions for a device), the result is sign extended from 32 bits to 64
bits and an incorrect value is displayed, in the status output of
dm-clone, as the number of hydrated regions.
Fix this by having dm_clone_nr_of_hydrated_regions() return an unsigned
int.
Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cd481c1226 upstream.
Add overflow check for clone->nr_regions variable, which holds the
number of regions of the target.
The overflow can occur with sufficiently large devices, if BITS_PER_LONG
== 32. E.g., if the region size is 8 sectors (4K), the overflow would
occur for device sizes > 34359738360 sectors (~16TB).
This could result in multiple device sectors wrongly mapping to the same
region number, due to the truncation from 64 bits to 32 bits, which
would lead to data corruption.
Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b93b6643e9 upstream.
If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
there's a crash in integrity_metadata().
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1edaa447d9 upstream.
Initializing a dm-writecache device can take a long time when the
persistent memory device is large. Add cond_resched() to a few loops
to avoid warnings that the CPU is stuck.
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6b40bec3b1 ]
Don't call quiesce(1) and quiesce(0) if array is already suspended,
otherwise in level_store, the array is writable after mddev_detach
in below part though the intention is to make array writable after
resume.
mddev_suspend(mddev);
mddev_detach(mddev);
...
mddev_resume(mddev);
And it also causes calltrace as follows in [1].
[48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
[...]
[48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
[48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
[48005.653984] RIP: 0010:kthread_park+0x77/0x90
[48005.654015] Call Trace:
[48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
[48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
[48005.654073] mddev_detach+0x30/0x70 [md_mod]
[48005.654090] level_store+0x202/0x670 [md_mod]
[48005.654099] ? security_capable+0x40/0x60
[48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
[48005.654123] kernfs_fop_write+0xce/0x1b0
[48005.654132] vfs_write+0xb6/0x1a0
[48005.654138] ksys_write+0x67/0xe0
[48005.654146] do_syscall_64+0x4e/0x140
[48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48005.654161] RIP: 0033:0x7fa0c8737497
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 120c9257f5 upstream.
This reverts commit effd58c95f.
blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.
"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes). Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).
Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:
xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev
before this revert:
253,0 21 1 0.000000000 2206 Q R 224 + 4072 [xfs_io]
253,0 21 2 0.000008267 2206 X R 224 / 480 [xfs_io]
253,0 21 3 0.000010530 2206 X R 224 / 256 [xfs_io]
253,0 21 4 0.000027022 2206 X R 480 / 736 [xfs_io]
253,0 21 5 0.000028751 2206 X R 480 / 512 [xfs_io]
253,0 21 6 0.000033323 2206 X R 736 / 992 [xfs_io]
253,0 21 7 0.000035130 2206 X R 736 / 768 [xfs_io]
253,0 21 8 0.000039146 2206 X R 992 / 1248 [xfs_io]
253,0 21 9 0.000040734 2206 X R 992 / 1024 [xfs_io]
253,0 21 10 0.000044694 2206 X R 1248 / 1504 [xfs_io]
253,0 21 11 0.000046422 2206 X R 1248 / 1280 [xfs_io]
253,0 21 12 0.000050376 2206 X R 1504 / 1760 [xfs_io]
253,0 21 13 0.000051974 2206 X R 1504 / 1536 [xfs_io]
253,0 21 14 0.000055881 2206 X R 1760 / 2016 [xfs_io]
253,0 21 15 0.000057462 2206 X R 1760 / 1792 [xfs_io]
253,0 21 16 0.000060999 2206 X R 2016 / 2272 [xfs_io]
253,0 21 17 0.000062489 2206 X R 2016 / 2048 [xfs_io]
253,0 21 18 0.000066133 2206 X R 2272 / 2528 [xfs_io]
253,0 21 19 0.000067507 2206 X R 2272 / 2304 [xfs_io]
253,0 21 20 0.000071136 2206 X R 2528 / 2784 [xfs_io]
253,0 21 21 0.000072764 2206 X R 2528 / 2560 [xfs_io]
253,0 21 22 0.000076185 2206 X R 2784 / 3040 [xfs_io]
253,0 21 23 0.000077486 2206 X R 2784 / 2816 [xfs_io]
253,0 21 24 0.000080885 2206 X R 3040 / 3296 [xfs_io]
253,0 21 25 0.000082316 2206 X R 3040 / 3072 [xfs_io]
253,0 21 26 0.000085788 2206 X R 3296 / 3552 [xfs_io]
253,0 21 27 0.000087096 2206 X R 3296 / 3328 [xfs_io]
253,0 21 28 0.000093469 2206 X R 3552 / 3808 [xfs_io]
253,0 21 29 0.000095186 2206 X R 3552 / 3584 [xfs_io]
253,0 21 30 0.000099228 2206 X R 3808 / 4064 [xfs_io]
253,0 21 31 0.000101062 2206 X R 3808 / 3840 [xfs_io]
253,0 21 32 0.000104956 2206 X R 4064 / 4096 [xfs_io]
253,0 21 33 0.001138823 0 C R 4096 + 200 [0]
after this revert:
253,0 18 1 0.000000000 4430 Q R 224 + 3896 [xfs_io]
253,0 18 2 0.000018359 4430 X R 224 / 256 [xfs_io]
253,0 18 3 0.000028898 4430 X R 256 / 512 [xfs_io]
253,0 18 4 0.000033535 4430 X R 512 / 768 [xfs_io]
253,0 18 5 0.000065684 4430 X R 768 / 1024 [xfs_io]
253,0 18 6 0.000091695 4430 X R 1024 / 1280 [xfs_io]
253,0 18 7 0.000098494 4430 X R 1280 / 1536 [xfs_io]
253,0 18 8 0.000114069 4430 X R 1536 / 1792 [xfs_io]
253,0 18 9 0.000129483 4430 X R 1792 / 2048 [xfs_io]
253,0 18 10 0.000136759 4430 X R 2048 / 2304 [xfs_io]
253,0 18 11 0.000152412 4430 X R 2304 / 2560 [xfs_io]
253,0 18 12 0.000160758 4430 X R 2560 / 2816 [xfs_io]
253,0 18 13 0.000183385 4430 X R 2816 / 3072 [xfs_io]
253,0 18 14 0.000190797 4430 X R 3072 / 3328 [xfs_io]
253,0 18 15 0.000197667 4430 X R 3328 / 3584 [xfs_io]
253,0 18 16 0.000218751 4430 X R 3584 / 3840 [xfs_io]
253,0 18 17 0.000226005 4430 X R 3840 / 4096 [xfs_io]
253,0 18 18 0.000250404 4430 Q R 4120 + 176 [xfs_io]
253,0 18 19 0.000847708 0 C R 4096 + 24 [0]
253,0 18 20 0.000855783 0 C R 4120 + 176 [0]
Fixes: effd58c95f ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 248aa2645a ]
In cases where dec_in_flight() has to requeue the integrity_bio_wait
work to transfer the rest of the data, the bio's __bi_remaining might
already have been decremented to 0, e.g.: if bio passed to underlying
data device was split via blk_queue_split().
Use dm_bio_{record,restore} rather than effectively open-coding them in
dm-integrity -- these methods now manage __bi_remaining too.
Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
Reported-by: Daniel Glöckner <dg@emlix.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1b17159e52 ]
Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 974f51e863 upstream.
We neither assign congested_fn for requested-based blk-mq device nor
implement it correctly. So fix both.
Also, remove incorrect comment from dm_init_normal_md_queue and rename
it to dm_init_congested_fn.
Fixes: 4aa9c692e0 ("bdi: separate out congested state into a separate struct")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ee63634bae upstream.
Dm-zoned initializes reference counters of new chunk works with zero
value and refcount_inc() is called to increment the counter. However, the
refcount_inc() function handles the addition to zero value as an error
and triggers the warning as follows:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
...
CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
...
Call Trace:
dmz_map+0x2d2/0x350 [dm_zoned]
__map_bio+0x42/0x1a0
__split_and_process_non_flush+0x14a/0x1b0
__split_and_process_bio+0x83/0x240
? kmem_cache_alloc+0x165/0x220
dm_process_bio+0x90/0x230
? generic_make_request_checks+0x2e7/0x680
dm_make_request+0x3e/0xb0
generic_make_request+0xcf/0x320
? memcg_drain_all_list_lrus+0x1c0/0x1c0
submit_bio+0x3c/0x160
? guard_bio_eod+0x2c/0x130
mpage_readpages+0x182/0x1d0
? bdev_evict_inode+0xf0/0xf0
read_pages+0x6b/0x1b0
__do_page_cache_readahead+0x1ba/0x1d0
force_page_cache_readahead+0x93/0x100
generic_file_read_iter+0x83a/0xe40
? __seccomp_filter+0x7b/0x670
new_sync_read+0x12a/0x1c0
vfs_read+0x9d/0x150
ksys_read+0x5f/0xe0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
After this warning, following refcount API calls for the counter all fail
to change the counter value.
Fix this by setting the initial reference counter value not zero but one
for the new chunk works. Instead, do not call refcount_inc() via
dmz_get_chunk_work() for the new chunks works.
The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
enabled. Refcount rework was merged to linux version 5.5 by the
commit 168829ad09 ("Merge branch 'locking-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
regardless of kernel configuration.
Linux version 4.20 merged the commit 092b564876 ("dm zoned: target: use
refcount_t for dm zoned reference counters"). Before this commit, dm
zoned used atomic_t APIs which does not check addition to zero, then this
fix is not necessary.
Fixes: 092b564876 ("dm zoned: target: use refcount_t for dm zoned reference counters")
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 41c526c5af upstream.
Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit adc0daad36 upstream.
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.
An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress. As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries
writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.
We must use drain_workqueue - that waits until the work and all re-queued
works finish.
As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.
With this change use of drain_workqueue() can be used effectively. This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7cdf6a0aae upstream.
The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
while :; do make check T=shell/lvconvert-thin-external-cache.sh; done
The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.
The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.
Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.
Fixes: c6b4fcbad0 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7fc2e47f40 upstream.
If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
not specified on the command line (i.e. ic->recalculate_flag is false),
dm-integrity would return invalid table line - the reported number of
arguments would not match the real number.
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 53770f0ec5 upstream.
If we need to perform synchronous I/O in dm_integrity_map_continue(),
we must make sure that we are not in the map function - in order to
avoid the deadlock due to bio queuing in generic_make_request. To
avoid the deadlock, we offload the request to metadata_wq.
However, metadata_wq also processes metadata updates for write requests.
If there are too many requests that get offloaded to metadata_wq at the
beginning of dm_integrity_map_continue, the workqueue metadata_wq
becomes clogged and the system is incapable of processing any metadata
updates.
This causes a deadlock because all the requests that need to do metadata
updates wait for metadata_wq to proceed and metadata_wq waits inside
wait_and_add_new_range until some existing request releases its range
lock (which doesn't happen because the range lock is released after
metadata update).
In order to fix the deadlock, we create a new workqueue offload_wq and
offload requests to it - so that processing of offload_wq is independent
from processing of metadata_wq.
Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d5bdf66108 upstream.
If we resume a device in bitmap mode and the on-disk format is in journal
mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
there would be non-recalculated blocks which would cause I/O errors.
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3918e0667b ]
[ 3934.173244] ======================================================
[ 3934.179572] WARNING: possible circular locking dependency detected
[ 3934.185884] 5.4.21-xfstests #1 Not tainted
[ 3934.190151] ------------------------------------------------------
[ 3934.196673] dmsetup/8897 is trying to acquire lock:
[ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
[ 3934.210268]
but task is already holding lock:
[ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
[ 3934.225083]
which lock already depends on the new lock.
[ 3934.564165] Chain exists of:
shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock
For a more detailed lockdep report, please see:
https://lore.kernel.org/r/20200220234519.GA620489@mit.edu
We shouldn't need to hold the lock while are just tearing down and
freeing the whole metadata pool structure.
Fixes: 44d8ebf436 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 29cda393bc ]
Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,
out_module_put:
module_put(THIS_MODULE);
out:
pr_info("error %s: %s", path, err);
return ret;
Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d1c3cc34f5 ]
Dan Carpenter points out that from commit 2aa8c52938 ("bcache: avoid
unnecessary btree nodes flushing in btree_flush_write()"), there is a
incorrect data type usage which leads to the following static checker
warning:
drivers/md/bcache/journal.c:444 btree_flush_write()
warn: 'ref_nr' unsigned <= 0
drivers/md/bcache/journal.c
422 static void btree_flush_write(struct cache_set *c)
423 {
424 struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
425 unsigned int i, nr, ref_nr;
^^^^^^
426 atomic_t *fifo_front_p, *now_fifo_front_p;
427 size_t mask;
428
429 if (c->journal.btree_flushing)
430 return;
431
432 spin_lock(&c->journal.flush_write_lock);
433 if (c->journal.btree_flushing) {
434 spin_unlock(&c->journal.flush_write_lock);
435 return;
436 }
437 c->journal.btree_flushing = true;
438 spin_unlock(&c->journal.flush_write_lock);
439
440 /* get the oldest journal entry and check its refcount */
441 spin_lock(&c->journal.lock);
442 fifo_front_p = &fifo_front(&c->journal.pin);
443 ref_nr = atomic_read(fifo_front_p);
444 if (ref_nr <= 0) {
^^^^^^^^^^^
Unsigned can't be less than zero.
445 /*
446 * do nothing if no btree node references
447 * the oldest journal entry
448 */
449 spin_unlock(&c->journal.lock);
450 goto out;
451 }
452 spin_unlock(&c->journal.lock);
As the warning information indicates, local varaible ref_nr in unsigned
int type is wrong, which does not matche atomic_read() and the "<= 0"
checking.
This patch fixes the above error by defining local variable ref_nr as
int type.
Fixes: 2aa8c52938 ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7c02b0055f ]
In bset.h, macro bset_bkey_last() is defined as,
bkey_idx((struct bkey *) (i)->d, (i)->keys)
Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.
bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().
Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)->keys to unsigned int type in macro bset_bkey_last().
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ae3cd29991 ]
The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
2510 out_free_path:
2511 kfree(path);
2512 out_module_put:
2513 module_put(THIS_MODULE);
2514 out:
2515 pr_info("error %s: %s", path, err);
2516 return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.
This patch changes line 2515 in the following way to fix the problem,
2515 pr_info("error %s: %s", path?path:"", err);
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 50246693f8 ]
Split the successful and error return path, and use one goto label for each
resource to unwind. This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e8547d4209 ]
Same as cache device, the buffer page needs to be put while
freeing cached_dev. Otherwise a page would be leaked every
time a cached_dev is stopped.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 873937e75f ]
The existing code allows changing the data device when the thin-pool
target is reloaded.
This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.
Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2aa8c52938 upstream.
the commit 91be66e131 ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
dirty nodes from tail of c->btree_cache util the btree node with
oldest journal entry is flushed. This is to reduce the time of holding
c->bucket_lock.
Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.
After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
a LRU-like list, the btree nodes with oldest journal entry won't
be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
entry, before flushing all the nodes the oldest journal entry cannot
be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.
Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
fifo c->journal.pin. In rested locations check the journal entries
locklessly, so their values can be changed on other cores
in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
point of fifo c->journal.pin. If it is different from the original
point which we get with locking c->journal.lock, it means the oldest
journal entry is reclaim on other cores. At this moment, all selected
dirty nodes recorded in array btree_nodes[] are all flushed and clean
on other CPU cores, it is unncessary to iterate c->btree_cache any
longer. Just quit the list_for_each_entry_safe_reverse() loop and
the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
loop. Check the refcount value of orignial fifo front point, if the
value is larger than selected node number of btree_nodes[], it means
more matching btree nodes should be scanned. Otherwise it means no
more matching btee nodes in rest of c->btree_cache list, the loop
can be quit. If the original oldest journal entry is reclaimed and
fifo front point is updated, the refcount of original fifo front point
will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
for space allocation for cached data, hold it for too long time will
block regular I/O requests. When iterating list c->btree_cache, even
there are a lot of maching btree nodes, in order to not holding
c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.
At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?
Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.
Fixes: 91be66e131 ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 038ba8cc1b upstream.
In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.
In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to alway bypass
normal readahead I/Os to save SSD space for now.
This patch adds options for readahead data cache policies via sysfs
file /sys/block/bcache<N>/readahead_cache_policy, the options are,
- "all": cache all readahead data I/Os.
- "meta-only": only cache meta data, and bypass other regular I/Os.
If users want to make bcache continue to only cache readahead request
for metadata and bypass regular data readahead, please set "meta-only"
to this sysfs file. By default, bcache will back to cache all read-
ahead requests now.
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 47ace7e012 upstream.
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).
Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt
While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.
Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab3458 ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c38 ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 44d8ebf436 upstream.
Ensure that the pool is locked during calls to __commit_transaction and
__destroy_persistent_data_objects. Just being consistent with locking,
but reality is dm_pool_metadata_close is called once pool is being
destroyed so access to pool shouldn't be contended.
Also, use pmd_write_lock_in_core rather than __pmd_write_lock in
dm_pool_commit_metadata and rename __pmd_write_lock to
pmd_write_lock_in_core -- there was no need for the alias.
In addition, verify that the pool is locked in __commit_transaction().
Fixes: 873f258bec ("dm thin metadata: do not write metadata if no changes occurred")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9402e95901 upstream.
GFP_KERNEL is not supposed to be or'd with GFP_NOFS (the result is
equivalent to GFP_KERNEL). Also, we use GFP_NOIO instead of GFP_NOFS
because we don't want any I/O being submitted in the direct reclaim
path.
Fixes: 39d13a1ac4 ("dm crypt: reuse eboiv skcipher for IV generation")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aa9509209c upstream.
When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache
It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.
In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4feaef830d upstream.
The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets. There are variants for
tracking metadata blocks and data blocks.
Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.
When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was. This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).
This fix keeps the 'begin' field, but now it's just a hint to speed up
the search. Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free. Much simpler.
This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.
Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b399629503 upstream.
dm-zoned is observed to log failed kernel assertions and not work
correctly when operating against a device with a zone size smaller
than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
bitmap size per zone is calculated as zero with such a small zone
size. Fix this problem and also make the code related to zone bitmap
management be able to handle per zone bitmaps smaller than a single
block.
A dm-zoned-tools patch is required to properly format dm-zoned devices
with zone sizes smaller than 128MiB.
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a4a8d28658 ]
dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.
Since commit 694cfe7f31 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd->pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd->pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.
Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.
In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.
Fixes: 694cfe7f31 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ad6bf88a6c upstream.
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 028288df63 ]
In raid1_sync_request func, rdev should be checked before reference.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a7ede3d168 ]
With commit 6ce220dd2f ("raid5: don't set
STRIPE_HANDLE to stripe which is in batch list"), we don't want to set
STRIPE_HANDLE flag for sh which is already in batch list.
However, the stripe which is the head of batch list should set this flag,
otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head),
it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved.
Thanks for Xiao's effort to verify the change.
Fixes: 6ce220dd2f ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list")
Reported-by: Xiao Ni <xni@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3b7436cc94 ]
For super_90_load, we need to make sure 'desc_nr' less
than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.
Fixes: 228fc7d76d ("md: avoid invalid memory access for array sb->dev_roles")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9fcc34b1a6 ]
In bch_mca_scan(), the number of shrinking btree node is calculated
by code like this,
unsigned long nr = sc->nr_to_scan;
nr /= c->btree_pages;
nr = min_t(unsigned long, nr, mca_can_free(c));
variable sc->nr_to_scan is number of objects (here is bcache B+tree
nodes' number) to shrink, and pointer variable sc is sent from memory
management code as parametr of a callback.
If sc->nr_to_scan is smaller than c->btree_pages, after the above
calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
and releasing mutex c->bucket_lock.
This patch checkes whether nr is 0 after the above calculation, if 0
is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
will try to shrink a single B+tree node.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 228fc7d76d ]
we need to gurantee 'desc_nr' valid before access array
of sb->dev_roles.
In addition, we should avoid .load_super always return '0'
when level is LEVEL_MULTIPATH, which is not expected.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
Fixes: 6a5cb53aaa ("md: no longer compare spare disk superblock events in super_load")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 84c529aea1 ]
bcache_allocator can call the following:
bch_allocator_thread()
-> bch_prio_write()
-> bch_bucket_alloc()
-> wait on &ca->set->bucket_wait
But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:
[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
[ 1158.495929] Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1158.504413] bcache_allocato D 0 15861 2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429] __schedule+0x2a8/0x670
[ 1158.504432] schedule+0x2d/0x90
[ 1158.504448] bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453] ? wait_woken+0x80/0x80
[ 1158.504466] bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476] bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491] kthread+0x121/0x140
[ 1158.504503] ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506] ? kthread_park+0xb0/0xb0
[ 1158.504510] ret_from_fork+0x35/0x40
Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.
Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.
BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>