linux

History

Eric Ren e7ee2c089e ocfs2: fix crash caused by stale lvb with fsdlm plugin The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) ocfs2: Beginning quota recovery on device (253,18) for slot 2 ocfs2: Finishing quota recovery on device (253,18) for slot 2 (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 ------------[ cut here ]------------ kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unsupported modules are loaded CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 Call Trace: ocfs2_setattr+0x698/0xa90 [ocfs2] notify_change+0x1ae/0x380 do_truncate+0x5e/0x90 do_sys_ftruncate.constprop.11+0x108/0x160 entry_SYSCALL_64_fastpath+0x12/0x6d Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. / if(!(kb_exflags & DLM_LKF_VALBLK)) / This means we miss the chance to return; * to invalid the LVB here. / The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) / dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID / ocfs2_meta_lvb_is_trustable() return 1 / so we don't refresh inode from disk / ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) / crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2017-01-10 18:31:54 -08:00
..
cluster	ktime: Get rid of the union	2016-12-25 17:21:22 +01:00
dlm	ocfs2/dlm: clean up deadcode in dlm_master_request_handler()	2016-12-12 18:55:06 -08:00
dlmfs	Replace <asm/uaccess.h> with <linux/uaccess.h> globally	2016-12-24 11:46:01 -08:00
Kconfig	…
Makefile	ocfs2: disable BUG assertions in reading blocks	2016-06-24 17:23:52 -07:00
acl.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-10-10 20:16:43 -07:00
acl.h	ocfs2: fix posix_acl_create deadlock	2016-05-12 15:52:50 -07:00
alloc.c	ocfs2: add newlines to some error messages	2016-12-10 12:39:45 -08:00
alloc.h	ocfs2: retry on ENOSPC if sufficient space in truncate log	2016-08-02 17:31:41 -04:00
aops.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-12-17 18:44:00 -08:00
aops.h	ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock()	2016-12-12 18:55:06 -08:00
blockcheck.c	…
blockcheck.h	…
buffer_head_io.c	block,fs: untangle fs.h and blk_types.h	2016-11-01 09:43:26 -06:00
buffer_head_io.h	…
dcache.c	VFS: normal filesystems (and lustre): d_inode() annotations	2015-04-15 15:06:57 -04:00
dcache.h	ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock	2014-04-03 16:20:55 -07:00
dir.c	ocfs2: fix not enough credit panic	2016-11-11 08:12:37 -08:00
dir.h	VFS: normal filesystems (and lustre): d_inode() annotations	2015-04-15 15:06:57 -04:00
dlmglue.c	ocfs2: fix crash caused by stale lvb with fsdlm plugin	2017-01-10 18:31:54 -08:00
dlmglue.h	ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread	2014-04-03 16:20:55 -07:00
export.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-04-26 17:22:07 -07:00
export.h	…
extent_map.c	ocfs2: neaten do_error, ocfs2_error and ocfs2_abort	2015-09-04 16:54:41 -07:00
extent_map.h	…
file.c	ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features	2016-12-10 12:39:45 -08:00
file.h	ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features	2016-12-10 12:39:45 -08:00
filecheck.c	ocfs2: sysfile interfaces for online file check	2016-03-22 15:36:02 -07:00
filecheck.h	ocfs2: sysfile interfaces for online file check	2016-03-22 15:36:02 -07:00
heartbeat.c	…
heartbeat.h	…
inode.c	ocfs2: replace CURRENT_TIME macro	2016-12-12 18:55:06 -08:00
inode.h	ocfs2: convert inode refcount test to a helper	2016-12-10 12:39:45 -08:00
ioctl.c	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
ioctl.h	…
journal.c	ocfs2: use time64_t to represent orphan scan times	2016-12-12 18:55:06 -08:00
journal.h	jbd2: add support for avoiding data writes during transaction commits	2016-04-24 00:56:07 -04:00
localalloc.c	ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local	2016-03-25 16:37:42 -07:00
localalloc.h	ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters	2014-02-06 13:48:51 -08:00
locks.c	ocfs2: fix flock panic issue	2015-12-29 17:45:49 -08:00
locks.h	…
mmap.c	ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock()	2016-12-12 18:55:06 -08:00
mmap.h	…
move_extents.c	ocfs2: convert inode refcount test to a helper	2016-12-10 12:39:45 -08:00
move_extents.h	…
namei.c	ocfs2: replace CURRENT_TIME macro	2016-12-12 18:55:06 -08:00
namei.h	ocfs2: do not include dio entry in case of orphan scan	2015-11-05 19:34:48 -08:00
ocfs1_fs_compat.h	…
ocfs2.h	ocfs2: use time64_t to represent orphan scan times	2016-12-12 18:55:06 -08:00
ocfs2_fs.h	ocfs2: fix comment in struct ocfs2_extended_slot	2016-05-19 19:12:14 -07:00
ocfs2_ioctl.h	…
ocfs2_lockid.h	…
ocfs2_lockingver.h	…
ocfs2_trace.h	switch generic_file_splice_read() to use of ->read_iter()	2016-10-05 18:23:56 -04:00
quota.h	quota: constify qtree_fmt_operations structures	2016-01-04 10:58:35 +01:00
quota_global.c	ocfs2: Protect periodic quota syncing with s_umount semaphore	2016-11-30 08:36:54 +01:00
quota_local.c	ocfs2: Use s_umount for quota recovery protection	2016-11-30 08:37:21 +01:00
refcounttree.c	vfs: fix isize/pos/len checks for reflink & dedupe	2016-12-22 23:00:23 -05:00
refcounttree.h	ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features	2016-12-10 12:39:45 -08:00
reservations.c	ocfs2: make resv_lock spinlock static	2015-02-10 14:30:29 -08:00
reservations.h	…
resize.c	ocfs2: solve a problem of crossing the boundary in updating backups	2016-03-25 16:37:42 -07:00
resize.h	…
slot_map.c	ocfs2: clean up an unneeded goto in ocfs2_put_slot()	2016-05-19 19:12:14 -07:00
slot_map.h	…
stack_o2cb.c	ocfs2: avoid a pointless delay in o2cb_cluster_check()	2015-04-14 16:48:57 -07:00
stack_user.c	Replace <asm/uaccess.h> with <linux/uaccess.h> globally	2016-12-24 11:46:01 -08:00
stackglue.c	ocfs2: fix crash caused by stale lvb with fsdlm plugin	2017-01-10 18:31:54 -08:00
stackglue.h	ocfs2: fix crash caused by stale lvb with fsdlm plugin	2017-01-10 18:31:54 -08:00
suballoc.c	ocfs2: fix double unlock in case retry after free truncate log	2016-09-19 15:36:17 -07:00
suballoc.h	ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failed	2014-04-03 16:20:56 -07:00
super.c	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs	2016-12-19 08:23:53 -08:00
super.h	ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local	2016-03-25 16:37:42 -07:00
symlink.c	vfs: remove ".readlink = generic_readlink" assignments	2016-12-09 16:45:04 +01:00
symlink.h	…
sysfile.c	ocfs2: avoid system inode ref confusion by adding mutex lock	2014-04-03 16:20:57 -07:00
sysfile.h	…
uptodate.c	ocfs2: remove NULL assignments on static	2014-06-04 16:53:53 -07:00
uptodate.h	…
xattr.c	ocfs2: convert inode refcount test to a helper	2016-12-10 12:39:45 -08:00
xattr.h	ocfs2: fix posix_acl_create deadlock	2016-05-12 15:52:50 -07:00