linux

Author	SHA1	Message	Date
Borislav Petkov	b11f2d6b80	EDAC/ghes: Check whether the driver is on the safe list correctly [ Upstream commit `251c54ea26` ] With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe, unregister and probe again a driver. When ghes_edac is attempted to be loaded on a system which is not on the safe platforms list, ghes_edac_register() would return early. The unregister counterpart ghes_edac_unregister() would still attempt to unregister and exit early at the refcount test, leading to the refcount underflow below. In order to not do anything on the unregister path too, reuse the force_load parameter and check it on that path too, before fumbling with the refcount. ghes_edac: ghes_edac_register: entry ghes_edac: ghes_edac_register: return -ENODEV ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100 Modules linked in: CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:refcount_warn_saturate+0xb9/0x100 Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c RSP: 0018:ffffc90000037d58 EFLAGS: 00010292 RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0 Call Trace: ghes_edac_unregister ghes_remove platform_drv_remove really_probe driver_probe_device device_driver_attach __driver_attach ? device_driver_attach ? device_driver_attach bus_for_each_dev bus_add_driver driver_register ? bert_init ghes_init do_one_initcall ? rcu_read_lock_sched_held kernel_init_freeable ? rest_init kernel_init ret_from_fork ... ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824 Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-01 13:18:14 +02:00
Robert Richter	f90edcff1e	EDAC/ghes: Fix grain calculation [ Upstream commit `7088e29e04` ] The current code to convert a physical address mask to a grain (defined as granularity in bytes) is: e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); This is broken in several ways: 1) It calculates to wrong grain values. E.g., a physical address mask of ~0xfff should give a grain of 0x1000. Without considering PAGE_MASK, there is an off-by-one. Things are worse when also filtering it with ~PAGE_MASK. This will calculate to a grain with the upper bits set. In the example it even calculates to ~0. 2) The grain does not depend on and is unrelated to the kernel's page-size. The page-size only matters when unmapping memory in memory_failure(). Smaller grains are wrongly rounded up to the page-size, on architectures with a configurable page-size (e.g. arm64) this could round up to the even bigger page-size of the hypervisor. Fix this with: e->grain = ~mem_err->physical_addr_mask + 1; The grain_bits are defined as: grain = 1 << grain_bits; Change also the grain_bits calculation accordingly, it is the same formula as in edac_mc.c now and the code can be unified. The value in ->physical_addr_mask coming from firmware is assumed to be contiguous, but this is not sanity-checked. However, in case the mask is non-contiguous, a conversion to grain_bits effectively converts the grain bit mask to a power of 2 by rounding it up. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-11-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-31 16:45:16 +01:00
Robert Richter	e240c7d1f1	EDAC/ghes: Do not warn when incrementing refcount on 0 [ Upstream commit `16214bd9e4` ] The following warning from the refcount framework is seen during ghes initialization: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) ------------[ cut here ]------------ refcount_t: increment on 0; use-after-free. WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked [...] Call trace: refcount_inc_checked ghes_edac_register ghes_probe ... It warns if the refcount is incremented from zero. This warning is reasonable as a kernel object is typically created with a refcount of one and freed once the refcount is zero. Afterwards the object would be "used-after-free". For GHES, the refcount is initialized with zero, and that is why this message is seen when initializing the first instance. However, whenever the refcount is zero, the device will be allocated and registered. Since the ghes_reg_mutex protects the refcount and serializes allocation and freeing of ghes devices, a use-after-free cannot happen here. Instead of using refcount_inc() for the first instance, use refcount_set(). This can be used here because the refcount is zero at this point and can not change due to its protection by the mutex. Fixes: `23f61b9fc5` ("EDAC/ghes: Fix locking and memory barrier issues") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <huangming23@huawei.com> Cc: James Morse <james.morse@arm.com> Cc: <linuxarm@huawei.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: <tanxiaofei@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <wanghuiqiang@huawei.com> Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-17 19:56:54 +01:00
Robert Richter	d2136afb51	EDAC/ghes: Fix locking and memory barrier issues [ Upstream commit `23f61b9fc5` ] The ghes registration and refcount is broken in several ways: * ghes_edac_register() returns with success for a 2nd instance even if a first instance's registration is still running. This is not correct as the first instance may fail later. A subsequent registration may not finish before the first. Parallel registrations must be avoided. * The refcount was increased even if a registration failed. This leads to stale counters preventing the device from being released. * The ghes refcount may not be decremented properly on unregistration. Always decrement the refcount once ghes_edac_unregister() is called to keep the refcount sane. * The ghes_pvt pointer is handed to the irq handler before registration finished. * The mci structure could be freed while the irq handler is running. Fix this by adding a mutex to ghes_edac_register(). This mutex serializes instances to register and unregister. The refcount is only increased if the registration succeeded. This makes sure the refcount is in a consistent state after registering or unregistering a device. Note: A spinlock cannot be used here as the code section may sleep. The ghes_pvt is protected by ghes_lock now. This ensures the pointer is not updated before registration was finished or while the irq handler is running. It is unset before unregistering the device including necessary (implicit) memory barriers making the changes visible to other CPUs. Thus, the device can not be used anymore by an interrupt. Also, rename ghes_init to ghes_refcount for better readability and switch to refcount API. A refcount is needed because there can be multiple GHES structures being defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error Source, "Some platforms may describe multiple Generic Hardware Error Source structures with different notification types, ..."). Another approach to use the mci's device refcount (get_device()) and have a release function does not work here. A release function will be called only for device_release() with the last put_device() call. The device must be deleted before that with device_del(). This is only possible by maintaining an own refcount. [ bp: touchups. ] Fixes: `0fe5f281f7` ("EDAC, ghes: Model a single, logical memory controller") Fixes: `1e72e673b9` ("EDAC/ghes: Fix Use after free in ghes_edac remove path") Co-developed-by: James Morse <james.morse@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-13 08:43:30 +01:00
James Morse	1e72e673b9	EDAC/ghes: Fix Use after free in ghes_edac remove path ghes_edac models a single logical memory controller, and uses a global ghes_init variable to ensure only the first ghes_edac_register() will do anything. ghes_edac is registered the first time a GHES entry in the HEST is probed. There may be multiple entries, so subsequent attempts to register ghes_edac are silently ignored as the work has already been done. When a GHES entry is unregistered, it calls ghes_edac_unregister(), which free()s the memory behind the global variables in ghes_edac. But there may be multiple GHES entries, the next call to ghes_edac_unregister() will dereference the free()d memory, and attempt to free it a second time. This may also be triggered on a platform with one GHES entry, if the driver is unbound/re-bound and unbound. The re-bind step will do nothing because of ghes_init, the second unbind will then do the same work as the first. Doing the unregister work on the first call is unsafe, as another CPU may be processing a notification in ghes_edac_report_mem_error(), using the memory we are about to free. ghes_init is already half of the reference counting. We only need to do the register work for the first call, and the unregister work for the last. Add the unregister check. This means we no longer free ghes_edac's memory while there are GHES entries that may receive a notification. This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE. [ bp: merge into a single patch. ] Fixes: `0fe5f281f7` ("EDAC, ghes: Model a single, logical memory controller") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com	2019-10-17 11:27:05 +02:00
Robert Richter	d55c79ac86	EDAC: Prefer 'unsigned int' to bare use of 'unsigned' Use of 'unsigned int' instead of bare use of 'unsigned'. Fix this for edac_mc*, ghes and the i5100 driver as reported by checkpatch.pl. While at it, struct member dev_ch_attribute->channel is always used as unsigned int. Change type to unsigned int to avoid type casts. [ bp: Massage. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190902123216.9809-2-rrichter@marvell.com	2019-09-03 19:21:19 +02:00
Thomas Gleixner	122375508b	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 172 Based on 1 normalized pattern(s): this file may be distributed under the terms of the gnu general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070034.395589349@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-30 11:26:39 -07:00
Fan Wu	c798c88f39	EDAC, ghes: Use CPER module handles to locate DIMMs Use SMBIOS module handle type 17, on platforms which provide valid ones, to locate the corresponding DIMM and thus have per-DIMM error counter updates. Signed-off-by: Fan Wu <wufan@codeaurora.org> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tyler Baicar <baicar.tyler@gmail.com> Reviewed-by: James Morse <james.morse@arm.com> Tested-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: baicar.tyler@gmail.com Cc: john.garry@huawei.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Cc: shiju.jose@huawei.com Cc: tanxiaofei@huawei.com Cc: wanghuiqiang@huawei.com Link: http://lkml.kernel.org/r/1537322340-1860-1-git-send-email-wufan@codeaurora.org	2018-09-22 18:35:40 +02:00
Borislav Petkov	eaa3a1d46c	EDAC, ghes: Make platform-based whitelisting x86-only ARM machines all have DMI tables so if they request hw error reporting through GHES, then the driver should be able to detect DIMMs and report errors successfully (famous last words :)). Make the platform-based list x86-specific so that ghes_edac can load on ARM. Reported-by: Qiang Zheng <zhengqiang10@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Tested-by: Qiang Zheng <zhengqiang10@huawei.com> Link: https://lkml.kernel.org/r/1526039543-180996-1-git-send-email-zhengqiang10@huawei.com	2018-05-21 12:18:57 +02:00
Borislav Petkov	a0671c39dd	EDAC, ghes: Use BIT() macro ... for improved readability. Also, add a local mask variable for the same reason. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>	2018-05-12 14:35:30 +02:00
Toshi Kani	ad0d73b324	EDAC, ghes: Add DDR4 and NVDIMM memory types The ghes_edac driver obtains memory type from SMBIOS type 17, but it does not recognize DDR4 and NVDIMM types. Add support of DDR4 and NVDIMM types. NVDIMM type is denoted by memory type DDR3/4 and non-volatile. Reported-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180509222030.9299-1-toshi.kani@hpe.com Signed-off-by: Borislav Petkov <bp@suse.de>	2018-05-12 14:16:49 +02:00
Alexandru Gagniuc	305d0e006a	EDAC, ghes: Remove unused argument to ghes_edac_report_mem_error() The use of the @ghes argument was removed in a previous commit, but function signature was not updated to reflect this. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-acpi@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180430213358.8319-1-mr.nuke.me@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>	2018-05-12 10:59:15 +02:00
Sughosh Ganu	a66bdf5d11	EDAC, ghes: Add a null pointer check in ghes_edac_unregister() Add a null check for ghes_pvt, before dereferencing it. The pointer could still be null in case the return path is taken before initialising ghes_pvt in the registration function. Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Sughosh Ganu <sughosh.ganu@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkml <linux-kernel@vger.kernel.org> Link: http://lkml.kernel.org/r/1524737809-24475-1-git-send-email-sughosh.ganu@arm.com Signed-off-by: Borislav Petkov <bp@suse.de>	2018-05-02 13:57:49 +02:00
Borislav Petkov	cc7f3f1326	ghes, EDAC: Fix ghes_edac registration Tony reported seeing "Internal error: Can't find EDAC structure" when injecting correctable errors due to the fact that ghes_edac would still load even if the whitelist won't hit. Drop the pr_err() in ghes_edac_report_mem_error() for now due to the hacky way how ghes_edac depends on ghes.c. While at it, make ghes_edac_register() return an error if it doesn't hit in the whitelist as it is the only sensible thing to do in that situation. Furthermore, move the call to it to happen last in ghes_probe() so that GHES initializing properly does not depend on ghes_edac init at all as latter is only reporting errors and not required for GHES's proper functioning. Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Tested-by: Sughosh Ganu <sughosh.ganu@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20180420182015.zao3olss4tvvlxki@agluck-desk	2018-05-02 13:57:30 +02:00
Toshi Kani	5deed6b6a4	EDAC, ghes: Add platform check The ghes_edac driver was introduced in 2013 [1], but it has not been enabled by any distro yet. This driver obtains error info from firmware interfaces (APEI), which are not properly implemented on many platforms, as the driver says on load: This EDAC driver relies on BIOS to enumerate memory and get error reports. Unfortunately, not all BIOSes reflect the memory layout correctly. So, the end result of using this driver varies from vendor to vendor. If you find incorrect reports, please contact your hardware vendor to correct its BIOS. To get out from this situation, add a platform check to selectively enable the driver on platforms that are known to have proper APEI firmware implementation. "ghes_edac.force_load=1" skips this platform check. [1]: https://lkml.kernel.org/r/cover.1360931635.git.mchehab@redhat.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-acpi@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170823225447.15608-4-toshi.kani@hpe.com Signed-off-by: Borislav Petkov <bp@suse.de>	2017-09-25 12:55:12 +02:00
Borislav Petkov	0fe5f281f7	EDAC, ghes: Model a single, logical memory controller We're enumerating the DIMMs through a DMI walk and since we can't get any more detailed topological information about which DIMMs belong to which memory controller, convert it to a single, logical controller which contains all the DIMMs. The error reporting path from GHES ghes_edac_report_mem_error() doesn't get called in NMI context but add a warning about it to catch any changes in the future as if so, our locking scheme will be insufficient then. Signed-off-by: Borislav Petkov <bp@suse.de>	2017-09-25 12:35:28 +02:00
Borislav Petkov	c9c8b4d6d0	EDAC, ghes: Remove symbol exports They're called from builtin code so no need for the exports. Signed-off-by: Borislav Petkov <bp@suse.de>	2017-09-25 12:35:27 +02:00
Borislav Petkov	c54182ec0e	EDAC: Get rid of mci->mod_ver It is a write-only variable so get rid of it. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Robert Richter <rric@kernel.org> Acked-by: Michal Simek <michal.simek@xilinx.com> Acked-by: Thor Thayer <thor.thayer@linux.intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Mark Gross <mark.gross@intel.com> Cc: Tim Small <tim@buttersideup.com> Cc: Ranganathan Desikan <ravi@jetztechnologies.com> Cc: "Arvind R." <arvino55@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: "Sören Brinkmann" <soren.brinkmann@xilinx.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Daney <david.daney@cavium.com> Cc: Loc Ho <lho@apm.com> Cc: linux-edac@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org	2017-07-17 13:42:48 +02:00
Mauro Carvalho Chehab	78d88e8a3d	edac: rename edac_core.h to edac_mc.h Now, all left at edac_core.h are at drivers/edac/edac_mc.c, so rename it to edac_mc.h. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-12-15 08:54:51 -02:00
Tan Xiaojun	990995bad1	EDAC: Fix PAGES_TO_MiB macro misuse The PAGES_TO_MiB macro is used for unit conversion but the trace_mc_event() tracepoint expects a page address. Fix that. Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1445341538-24271-1-git-send-email-tanxiaojun@huawei.com Signed-off-by: Borislav Petkov <bp@suse.de>	2015-10-22 22:57:30 +02:00
Aravind Gopalakrishnan	58a9c251c9	EDAC, ghes_edac: Remove redundant memory_type array We already have edac_mem_types[] that enumerates the different kinds of memory. So, use that and remove the redundant memory_type[] array here. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1442436811-23382-2-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2015-09-23 16:59:25 +02:00
Dan Carpenter	665aa8cdc4	ghes_edac: Use snprintf() to silence a static checker warning My static checker complains because the "e->location" has up to 256 characters but we are copying it into the "pvt->detail_location" which only has space for 240 characters. That's not counting the surrounding text and the "e->other_detail" string which can be over 80 characters long. I am not familiar with this code but presumably it normally works. Let's add a limit though for safety. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Link: http://lkml.kernel.org/r/20140801082514.GD28869@mwanda Signed-off-by: Borislav Petkov <bp@suse.de>	2014-11-11 18:08:56 +01:00
Mauro Carvalho Chehab	37e59f876b	[media, edac] Change my email address There are several left overs with my old email address. Remove their occurrences and add myself at CREDITS, to allow people to be able to reach me on my new addresses. Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>	2014-02-07 08:03:07 -02:00
Chen, Gong	56507694de	EDAC, GHES: Update ghes error record info In latest UEFI spec(by now it's 2.4) there are some new fields for memory error reporting. Add these new fields for ghes_edac interface. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2013-10-23 10:11:00 -07:00
Chen, Gong	147de14772	ACPI, APEI, CPER: Add UEFI 2.4 support for memory error In latest UEFI spec(by now it is 2.4) memory error definition for CPER (UEFI 2.4 Appendix N Common Platform Error Record) adds some new fields. These fields help people to locate memory error to an actual DIMM location. Original-author: Tony Luck <tony.luck@intel.com> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2013-10-23 10:10:20 -07:00
Wei Yongjun	5dae92a718	ghes_edac: fix to use list_for_each_entry_safe() when delete list items Since we will remove items off the list using list_del() we need to use a safe version of the list_for_each_entry() macro aptly named list_for_each_entry_safe(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-26 10:05:35 -03:00
Mauro Carvalho Chehab	8ae8f50ad8	ghes_edac: Fix RAS tracing With the current version of CPER, there's no way to associate an error with the memory error. So, the error location in EDAC layers is unused. As CPER has its own idea about memory architectural layers, just output whatever is there inside the driver's detail at the RAS tracepoint. The EDAC location keeps untouched, in the case that, in some future, we could actually map the error into the dimm labels. Now, the error message: [ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 72.396627] {1}[Hardware Error]: APEI generic hardware error status [ 72.396628] {1}[Hardware Error]: severity: 2, corrected [ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected [ 72.396632] {1}[Hardware Error]: flags: 0x01 [ 72.396634] {1}[Hardware Error]: primary [ 72.396635] {1}[Hardware Error]: section_type: memory error [ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400 [ 72.396638] {1}[Hardware Error]: node: 3 [ 72.396639] {1}[Hardware Error]: card: 0 [ 72.396640] {1}[Hardware Error]: module: 0 [ 72.396641] {1}[Hardware Error]: device: 0 [ 72.396643] {1}[Hardware Error]: error_type: 18, unknown [ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory) Is properly represented on the trace event: kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location👎-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory) Tested on a 4 sockets E5-4650 Sandy Bridge machine. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:17 -03:00
Mauro Carvalho Chehab	689c9cd812	ghes_edac: Make it compliant with UEFI spec 2.3.1 The UEFI spec defines the memory error types ans the bits that validate each field on the memory error record, at Appendix N om items N.2.5 (Memory Error Section) and N.2.11 (Error Status). Make the error description compliant with it, only showing the valid fields. The EDAC error log is now properly reporting the error: [ 281.556854] mce: [Hardware Error]: Machine check events logged [ 281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 281.557044] {2}[Hardware Error]: APEI generic hardware error status [ 281.557046] {2}[Hardware Error]: severity: 2, corrected [ 281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected [ 281.557050] {2}[Hardware Error]: flags: 0x01 [ 281.557052] {2}[Hardware Error]: primary [ 281.557053] {2}[Hardware Error]: section_type: memory error [ 281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400 [ 281.557056] {2}[Hardware Error]: node: 3 [ 281.557057] {2}[Hardware Error]: card: 0 [ 281.557058] {2}[Hardware Error]: module: 1 [ 281.557059] {2}[Hardware Error]: device: 0 [ 281.557061] {2}[Hardware Error]: error_type: 18, unknown [ 281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9 [ 281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory) Tested on a 4 CPUs E5-4650 Sandy Bridge machine. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:16 -03:00
Mauro Carvalho Chehab	d2a6856614	ghes_edac: Improve driver's printk messages Provide a better infrastructure for printk's inside the driver: - use edac_dbg() for debug messages; - standardize the usage of pr_info(); - provide warning about the risk of relying on this driver. While here, changes the size of a fake memory to 1 page. This is as good or as bad as 1000 pages, but it is easier for userspace to detect, as I don't expect that any machine implementing GHES would provide just 1 page available ;) Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Conflicts: drivers/edac/ghes_edac.c	2013-02-25 19:42:15 -03:00
Mauro Carvalho Chehab	5ee726db52	ghes_edac: Don't credit the same memory dimm twice On my tests on a 4xE5-4650 CPU's system, the GHES EDAC driver is called twice. As the SMBIOS DMI enumeration call will seek for the entire DIMM sockets in the system, on this machine, equipped with 128 GB of RAM, the memory is displayed twice: +-----------------------+ \| mc0 \| mc1 \| ----------+-----------------------+ memory45: \| 8192 MB \| 8192 MB \| memory44: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory43: \| 0 MB \| 0 MB \| memory42: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory41: \| 0 MB \| 0 MB \| memory40: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory39: \| 8192 MB \| 8192 MB \| memory38: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory37: \| 0 MB \| 0 MB \| memory36: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory35: \| 0 MB \| 0 MB \| memory34: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory33: \| 8192 MB \| 8192 MB \| memory32: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory31: \| 0 MB \| 0 MB \| memory30: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory29: \| 0 MB \| 0 MB \| memory28: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory27: \| 8192 MB \| 8192 MB \| memory26: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory25: \| 0 MB \| 0 MB \| memory24: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory23: \| 0 MB \| 0 MB \| memory22: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory21: \| 8192 MB \| 8192 MB \| memory20: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory19: \| 0 MB \| 0 MB \| memory18: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory17: \| 0 MB \| 0 MB \| memory16: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory15: \| 8192 MB \| 8192 MB \| memory14: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory13: \| 0 MB \| 0 MB \| memory12: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory11: \| 0 MB \| 0 MB \| memory10: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory9: \| 8192 MB \| 8192 MB \| memory8: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory7: \| 0 MB \| 0 MB \| memory6: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ memory5: \| 0 MB \| 0 MB \| memory4: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory3: \| 8192 MB \| 8192 MB \| memory2: \| 0 MB \| 0 MB \| ----------+-----------------------+ memory1: \| 0 MB \| 0 MB \| memory0: \| 8192 MB \| 8192 MB \| ----------+-----------------------+ Total sum of 256 GB. As there's no reliable way to credit DIMMS to the right memory controller, just put everything on memory controller 0 (with should always exist). Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:14 -03:00
Mauro Carvalho Chehab	32fa1f53c2	ghes_edac: do a better job of filling EDAC DIMM info Instead of just faking a random value for the DIMM data, get the information that it is available via DMI table. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:13 -03:00
Mauro Carvalho Chehab	f04c62a703	ghes_edac: add support for reporting errors via EDAC Now that the EDAC core is capable of just forward the errors via the userspace API, add a report mechanism for the GHES errors. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:13 -03:00
Mauro Carvalho Chehab	77c5f5d2f2	ghes_edac: Register at EDAC core the BIOS report Register GHES at EDAC MC core, in order to avoid other drivers to also handle errors and mangle with error data. The edac core will warrant that just one driver will be used, so the first one to register (BIOS first) will be the one that will be reporting the hardware errors. For now, the EDAC driver does nothing but to register at the EDAC core, preventing the hardware-driven mechanism to interfere with GHES. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2013-02-25 19:42:12 -03:00

33 Commits