The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When there are errors in the ocfs2 filesystem, they are usually
accompanied by the inode number which caused the error. This inode
number would be the input to fixing the file. One of these options
could be considered:
A file in the sys filesytem which would accept inode numbers. This
could be used to communication back what has to be fixed or is fixed.
You could write:
$# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/check
or
$# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/fix
Compare with second version, I re-design filecheck sysfs interfaces,
there are three sysfs files (check, fix and set) under filecheck
directory (see above), sysfs will accept only one argument <inode>.
Second, I adjust some code in ocfs2_filecheck_repair_inode_block()
function according to upstream feedback, we cannot just add VALID_FL
flag back as a inode block fix, then we will not fix this field
corruption currently until having a complete solution. Compare with
first version, I use strncasecmp instead of double strncmp functions.
Second, update the source file contribution vendor.
This patch (of 4):
Export ocfs2_kset object from ocfs2_stackglue kernel module, then online
file check code will create the related sysfiles under ocfs2_kset
object. We're exporting this because it's built in ocfs2_stackglue.ko.
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is done to differentiate between using and not using controld and
use the connection information accordingly.
We need to be backward compatible. So, we use a new enum
ocfs2_connection_type to identify when controld is used and when it is
not.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
This patch (of 6):
Add clustername to cluster connection.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unlike ocfs2, dlmfs has no permanent storage. It can't store off a
cluster stack it is supposed to be using. So it can't specify the stack
name in ocfs2_cluster_connect().
Instead, we create ocfs2_cluster_connect_agnostic(), which simply uses
the stack that is currently enabled. This is find for dlmfs, which will
rely on the stack initialization.
We add the "stackglue" capability to dlmfs's capability list. This lets
userspace know dlmfs can be used with all cluster stacks.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Inside the stackglue, the locking protocol structure is hanging off of
the ocfs2_cluster_connection. This takes it one further; the locking
protocol is passed into ocfs2_cluster_connect(). Now different cluster
connections can have different locking protocols with distinct asts.
Note that all locking protocols have to keep their maximum protocol
version in lock-step.
With the protocol structure set in ocfs2_cluster_connect(), there is no
need for the stackglue to have a static pointer to a specific protocol
structure. We can change initialization to only pass in the maximum
protocol version.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
With the full ocfs2_locking_protocol hanging off of the
ocfs2_cluster_connection, ast wrappers can get the ast/bast pointers
there. They don't need to get them from their plugin structure.
The user plugin still needs the maximum locking protocol version,
though. This changes the plugin structure so that it only holds the max
version, not the entire ocfs2_locking_protocol pointer.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
With the ocfs2_cluster_connection hanging off of the ocfs2_dlm_lksb, we
have access to it in the ast and bast wrapper functions. Attach the
ocfs2_locking_protocol to the conn.
Now, instead of refering to a static variable for ast/bast pointers, the
wrappers can look at the connection. This means different connections
can have different ast/bast pointers, and it reduces the need for the
static pointer.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We're going to want it in the ast functions, so we convert union
ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The stackglue ast and bast functions tried to maintain the fiction that
their arguments were void pointers. In reality, stack_user.c had to
know that the argument was an ocfs2_lock_res in order to get the status
off of the lksb. That's ugly.
This changes stackglue to always pass the lksb as the argument to ast
and bast functions. The caller can always use container_of() to get the
ocfs2_lock_res or user_dlm_lock_res. The net effect to the caller is
zero. They still get back the lockres in their ast. stackglue gets
cleaner, and now can use the lksb itself.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The Lock Value Block (LVB) of a DLM lock can be lost when nodes die and
the DLM cannot reconstruct its state. Clients of the DLM need to know
this.
ocfs2's internal DLM, o2dlm, explicitly zeroes out the LVB when it loses
track of the state. This is not a standard behavior, but ocfs2 has
always relied on it. Thus, an o2dlm LVB is always "valid".
ocfs2 now supports both o2dlm and fs/dlm via the stack glue. When
fs/dlm loses track of an LVBs state, it sets a flag
(DLM_SBF_VALNOTVALID) on the Lock Status Block (LKSB). The contents of
the LVB may be garbage or merely stale.
ocfs2 doesn't want to try to guess at the validity of the stale LVB.
Instead, it should be checking the VALNOTVALID flag. As this is the
'standard' way of treating LVBs, we will promote this behavior.
We add a stack glue API ocfs2_dlm_lvb_valid(). It returns non-zero when
the LVB is valid. o2dlm will always return valid, while fs/dlm will
check VALNOTVALID.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
This is actually pretty easy since fs/dlm already handles the bulk of the
work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the
underlying lock manager, so I only had to add the right calls.
Cluster-aware POSIX locks ("plocks") can be turned off by the same means at
UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume.
Internally, the file system uses two sets of file_operations, depending on
whether cluster aware plocks is required. This turns out to be easier than
implementing local-only versions of ->lock.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ->hangup() call was only used to execute ocfs2_hb_ctl. Now that
the generic stack glue code handles this, the underlying stack drivers
don't need to know about it.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Take o2hb_stop() out of the o2cb code and make it part of the generic
stack glue as ocfs2_leave_group(). This also allows us to remove the
ocfs2_get_hb_ctl_path() function - everything to do with hb_ctl is now
part of stackglue.c. o2cb no longer needs a ->hangup() function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 needs to call out to the hb_ctl program at unmount for all cluster
stacks. The first step is to move the hb_ctl_path sysctl out of the
o2cb code and into the generic stack glue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add code to use fs/dlm.
[ Modified to be part of the stack_user module -- Joel ]
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Userspace can now query and specify the cluster stack in use via the
/sys/fs/ocfs2/cluster_stack file. By default, it is 'o2cb', which is
the classic stack. Thus, old tools that do not know how to modify this
file will work just fine. The stack cannot be modified if there is a
live filesystem.
ocfs2_cluster_connect() now takes the expected cluster stack as an
argument. This way, the filesystem and the stack glue ensure they are
speaking to the same backend.
If the stack is 'o2cb', the o2cb stack plugin is used. For any other
value, the fsdlm stack plugin is selected.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.
The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().
ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Define the ocfs2_stack_operations structure. Build o2cb_stack_ops from
all of the o2cb-specific stack functions. Change the generic stack glue
functions to call the stack_ops instead of the o2cb functions directly.
The o2cb functions are moved to stack_o2cb.c. The headers are cleaned up
to where only needed headers are included.
In this code, stackglue.c and stack_o2cb.c refer to some shared
extern variables. When they become modules, that will change.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlmglue.c was still referencing a raw o2dlm lksb in one instance. Let's
create a generic ocfs2_dlm_dump_lksb() function. This allows underlying
DLMs to print whatever they want about their lock.
We then move the o2dlm dump into stackglue.c where it belongs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.
We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.
The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.
With this change, only stackglue.c includes cluster/ headers.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.
[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
truly opaque. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.
heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.
The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Wrap the lock status block (lksb) in a union. Later we will add a union
element for the fs/dlm lksb. Create accessors for the status and lvb
fields.
Other than a debugging function, dlmglue.c does not directly reference
the o2dlm locking path anymore.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the ocfs2_dlm_lock/unlock() functions to return -errno values.
This is the first step towards elminiating dlm_status in
fs/ocfs2/dlmglue.c. The change also passes -errno values to
->unlock_ast().
[ Fix a return code in dlmglue.c and change the error translation table into
an array of ints. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 generic code should use the values in <linux/dlmconstants.h>.
stackglue.c will convert them to o2dlm values.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>