Vladimir reported the following issue:
Commit c65c1877bd ("slub: use lockdep_assert_held") requires
remove_partial() to be called with n->list_lock held, but free_partial()
called from kmem_cache_close() on cache destruction does not follow this
rule, leading to a warning:
WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0()
Modules linked in:
CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1
Hardware name:
0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600
0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000
ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0
Call Trace:
__kmem_cache_shutdown+0x1b2/0x1f0
kmem_cache_destroy+0x43/0xf0
xfs_destroy_zones+0x103/0x110 [xfs]
exit_xfs_fs+0x38/0x4e4 [xfs]
SyS_delete_module+0x19a/0x1f0
system_call_fastpath+0x16/0x1b
His solution was to add a spinlock in order to quiet lockdep. Although
there would be no contention to adding the lock, that lock also requires
disabling of interrupts which will have a larger impact on the system.
Instead of adding a spinlock to a location where it is not needed for
lockdep, make a __remove_partial() function that does not test if the
list_lock is held, as no one should have it due to it being freed.
Also added a __add_partial() function that does not do the lock
validation either, as it is not needed for the creation of the cache.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c65c1877bd ("slub: use lockdep_assert_held") incorrectly
required that add_full() and remove_full() hold n->list_lock. The lock
is only taken when kmem_cache_debug(s), since that's the only time it
actually does anything.
Require that the lock only be taken under such a condition.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull SLAB changes from Pekka Enberg:
"Random bug fixes that have accumulated in my inbox over the past few
months"
* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm: Fix warning on make htmldocs caused by slab.c
mm: slub: work around unneeded lockdep warning
mm: sl[uo]b: fix misleading comments
slub: Fix possible format string bug.
slub: use lockdep_assert_held
slub: Fix calculation of cpu slabs
slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings
The slub code does some setup during early boot in
early_kmem_cache_node_alloc() with some local data. There is no
possible way that another CPU can see this data, so the slub code
doesn't unnecessarily lock it. However, some new lockdep asserts
check to make sure that add_partial() _always_ has the list_lock
held.
Just add the locking, even though it is technically unnecessary.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Commit abca7c4965 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c4965 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 309381feae ("mm: dump page when hitting a VM_BUG_ON using
VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls.
But, most of the ones in the slub code are for _temporary_ 'struct
page's which are declared on the stack and likely have lots of gunk in
them. Dumping their contents out will just confuse folks looking at
bad_page() output. Plus, if we try to page_to_pfn() on them or
soemthing, we'll probably oops anyway.
Turn them back in to VM_BUG_ON()s.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.
I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.
This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.
[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "name" is determined at runtime and is parsed as format string.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Instead of using comments in an attempt at getting the locking right,
use proper assertions that actively warn you if you got it wrong.
Also add extra braces in a few sites to comply with coding-style.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
/sys/kernel/slab/:t-0000048 # cat cpu_slabs
231 N0=16 N1=215
/sys/kernel/slab/:t-0000048 # cat slabs
145 N0=36 N1=109
See, the number of slabs is smaller than that of cpu slabs.
The bug was introduced by commit 49e2258586
("slub: per cpu cache for partial pages").
We should use page->pages instead of page->pobjects when calculating
the number of cpu partial slabs. This also fixes the mapping of slabs
and nodes.
As there's no variable storing the number of total/active objects in
cpu partial slabs, and we don't have user interfaces requiring those
statistics, I just add WARN_ON for those cases.
Cc: <stable@vger.kernel.org> # 3.2+
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Pull SLAB changes from Pekka Enberg:
"The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
slab internals similar to mm/slub.c. This reduces memory usage and
improves performance:
https://lkml.org/lkml/2013/10/16/155
Rest of the changes are bug fixes from various people"
* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
mm, slub: fix the typo in mm/slub.c
mm, slub: fix the typo in include/linux/slub_def.h
slub: Handle NULL parameter in kmem_cache_flags
slab: replace non-existing 'struct freelist *' with 'void *'
slab: fix to calm down kmemleak warning
slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
slab: rename slab_bufctl to slab_freelist
slab: remove useless statement for checking pfmemalloc
slab: use struct page for slab management
slab: replace free and inuse in struct slab with newly introduced active
slab: remove SLAB_LIMIT
slab: remove kmem_bufctl_t
slab: change the management method of free objects of the slab
slab: use __GFP_COMP flag for allocating slab pages
slab: use well-defined macro, virt_to_slab()
slab: overloading the RCU head over the LRU for RCU free
slab: remove cachep in struct slab_rcu
slab: remove nodeid in struct slab
slab: remove colouroff in struct slab
slab: change return type of kmem_getpages() to struct page
...
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
We can't see the relationship with memcg from the parameters,
so the name with memcg_idx would be more reasonable.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move all kmemleak calls into hook functions, and make it so
that all hooks (both inside and outside of #ifdef CONFIG_SLUB_DEBUG)
call the appropriate kmemleak routines. This allows for kmemleak
to be configured independently of slub debug features.
It also fixes a bug where kmemleak was only partially enabled in some
configurations.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Roman Bobniev <Roman.Bobniev@sonymobile.com>
Signed-off-by: Tim Bird <tim.bird@sonymobile.com>
Signed-off-by: Pekka Enberg <penberg@iki.fi>
Pull SLAB update from Pekka Enberg:
"Nothing terribly exciting here apart from Christoph's kmalloc
unification patches that brings sl[aou]b implementations closer to
each other"
* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slab: Use correct GFP_DMA constant
slub: remove verify_mem_not_deleted()
mm/sl[aou]b: Move kmallocXXX functions to common code
mm, slab_common: add 'unlikely' to size check of kmalloc_slab()
mm/slub.c: beautify code for removing redundancy 'break' statement.
slub: Remove unnecessary page NULL check
slub: don't use cpu partial pages on UP
mm/slub: beautify code for 80 column limitation and tab alignment
mm/slub: remove 'per_cpu' which is useless variable
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete. Thus, kstrtoul() should be used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmalloc* functions of all slab allcoators are similar now so
lets move them into slab.h. This requires some function naming changes
in slob.
As a results of this patch there is a common set of functions for
all allocators. Also means that kmalloc_large() is now available
in general to perform large order allocations that go directly
via the page allocator. kmalloc_large() can be substituted if
kmalloc() throws warnings because of too large allocations.
kmalloc_large() has exactly the same semantics as kmalloc but
can only used for allocations > PAGE_SIZE.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
In commit 4d7868e6(slub: Do not dereference NULL pointer in node_match)
had added check for page NULL in node_match. Thus, it is not needed
to check it before node_match, remove it.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
This reverts commit 318df36e57.
This commit caused Steven Rostedt's hackbench runs to run out of memory
due to a leak. As noted by Joonsoo Kim, it is buggy in the following
scenario:
"I guess, you may set 0 to all kmem caches's cpu_partial via sysfs,
doesn't it?
In this case, memory leak is possible in following case. Code flow of
possible leak is follwing case.
* in __slab_free()
1. (!new.inuse || !prior) && !was_frozen
2. !kmem_cache_debug && !prior
3. new.frozen = 1
4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1
5. with this patch, put_cpu_partial() doesn't do anything,
because this cache's cpu_partial is 0
6. return
In step 5, leak occur"
And Steven does indeed have cpu_partial set to 0 due to RT testing.
Joonsoo is cooking up a patch, but everybody agrees that reverting this
for now is the right thing to do.
Reported-and-bisected-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Be sure of 80 column limitation for both code and comments.
Correct tab alignment for 'if-else' statement.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Remove 'per_cpu', since it is useless now after the patch: "205ab99
slub: Update statistics handling for variable order slabs". And the
partial list is handled in the same way as the per cpu slab.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.
[1] https://lkml.org/lkml/2013/5/20/589
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Pull slab update from Pekka Enberg:
"Highlights:
- Fix for boot-time problems on some architectures due to
init_lock_keys() not respecting kmalloc_caches boundaries
(Christoph Lameter)
- CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)
- Fix for excessive slab freelist draining (Wanpeng Li)
- SLUB and SLOB cleanups and fixes (various people)"
I ended up editing the branch, and this avoids two commits at the end
that were immediately reverted, and I instead just applied the oneliner
fix in between myself.
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
slub: Check for page NULL before doing the node_match check
mm/slab: Give s_next and s_stop slab-specific names
slob: Check for NULL pointer before calling ctor()
slub: Make cpu partial slab support configurable
slab: add kmalloc() to kernel API documentation
slab: fix init_lock_keys
slob: use DIV_ROUND_UP where possible
slub: do not put a slab to cpu partial list when cpu_partial is 0
mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
mm/slub: Drop unnecessary nr_partials
mm/slab: Fix /proc/slabinfo unwriteable for slab
mm/slab: Sharing s_next and s_stop between slab and slub
mm/slab: Fix drain freelist excessively
slob: Rework #ifdeffery in slab.h
mm, slab: moved kmem_cache_alloc_node comment to correct place
In the -rt kernel (mrg), we hit the following dump:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
PGD a2d39067 PUD b1641067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU 3
Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
RIP: 0010:[<ffffffff811573f1>] [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
RSP: 0018:ffff8800a9b17d70 EFLAGS: 00010213
RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
FS: 00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
Stack:
ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
0000000001200011 0000000001200011 0000000000000000 0000000000000000
00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
Call Trace:
[<ffffffff81202e08>] ? current_has_perm+0x68/0x80
[<ffffffff81041cbd>] copy_process+0xdd/0x15b0
[<ffffffff810a2125>] ? rt_up_read+0x25/0x30
[<ffffffff8104369a>] do_fork+0x5a/0x360
[<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
[<ffffffff8100b068>] sys_clone+0x28/0x30
[<ffffffff81527423>] stub_clone+0x13/0x20
[<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
RIP [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
RSP <ffff8800a9b17d70>
CR2: 0000000000000000
---[ end trace 0000000000000002 ]---
Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
disable migration. But the SLUB code is relatively lockless, and the
spin_locks there are raw_spin_locks (not converted to mutexes), thus I
believe this bug can happen in mainline without -rt features. The -rt
patch is just good at triggering mainline bugs ;-)
Anyway, looking at where this crashed, it seems that the page variable
can be NULL when passed to the node_match() function (which does not
check if it is NULL). When this happens we get the above panic.
As page is only used in slab_alloc() to check if the node matches, if
it's NULL I'm assuming that we can say it doesn't and call the
__slab_alloc() code. Is this a correct assumption?
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CPU partial support can introduce level of indeterminism that is not
wanted in certain context (like a realtime kernel). Make it
configurable.
This patch is based on Christoph Lameter's "slub: Make cpu partial slab
support configurable V2".
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
In free path, we don't check number of cpu_partial, so one slab can
be linked in cpu partial list even if cpu_partial is 0. To prevent this,
we should check number of cpu_partial in put_cpu_partial().
Acked-by: Christoph Lameeter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Use existing interface node_nr_slabs and node_nr_objs to get
nr_slabs and nr_objs.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
This patch remove unused nr_partials variable.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Pull slab changes from Pekka Enberg:
"The bulk of the changes are more slab unification from Christoph.
There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
the mix."
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
mm, slab_common: Fix bootstrap creation of kmalloc caches
slab: Return NULL for oversized allocations
mm: slab: Verify the nodeid passed to ____cache_alloc_node
slub: tid must be retrieved from the percpu area of the current processor
slub: Do not dereference NULL pointer in node_match
slub: add 'likely' macro to inc_slabs_node()
slub: correct to calculate num of acquired objects in get_partial_node()
slub: correctly bootstrap boot caches
mm/sl[au]b: correct allocation type check in kmalloc_slab()
slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
slab: Handle ARCH_DMA_MINALIGN correctly
slab: Common definition for kmem_cache_node
slab: Rename list3/l3 to node
slab: Common Kmalloc cache determination
stat: Use size_t for sizes instead of unsigned
slab: Common function to create the kmalloc array
slab: Common definition for the array of kmalloc caches
slab: Common constants for kmalloc boundaries
slab: Rename nodelists to node
slab: Common name for the per node structures
...
Squishes a statement-with-no-effect warning, removes some ifdefs and
shrinks .text by 2 bytes.
Note that this code fails to check for blocking_notifier_chain_register()
failures.
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Steven Rostedt has pointer out: rescheduling could occur on a
different processor after the determination of the per cpu pointer and
before the tid is retrieved. This could result in allocation from the
wrong node in slab_alloc().
The effect is much more severe in slab_free() where we could free to the
freelist of the wrong page.
The window for something like that occurring is pretty small but it is
possible.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The variables accessed in slab_alloc are volatile and therefore
the page pointer passed to node_match can be NULL. The processing
of data in slab_alloc is tentative until either the cmpxhchg
succeeds or the __slab_alloc slowpath is invoked. Both are
able to perform the same allocation from the freelist.
Check for the NULL pointer in node_match.
A false positive will lead to a retry of the loop in __slab_alloc.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
After boot phase, 'n' always exist.
So add 'likely' macro for helping compiler.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
After we create a boot cache, we may allocate from it until it is bootstraped.
This will move the page from the partial list to the cpu slab list. If this
happens, the loop:
list_for_each_entry(p, &n->partial, lru)
that we use to scan for all partial pages will yield nothing, and the pages
will keep pointing to the boot cpu cache, which is of course, invalid. To do
that, we should flush the cache to make sure that the cpu slab is back to the
partial list.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Steffen Michalke <StMichalke@web.de>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
lockdep, but it's a mechanical change.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
+GszxsFkCjlW0mK0egTb
=tiSY
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s <signature> option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as
it is used throughout the arch code a rename would likely cause more
conflicts than it is worth.
[akpm@linux-foundation.org: fix zcache]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extract the optimized lookup functions from slub and put them into
slab_common.c. Then make slab use these functions as well.
Joonsoo notes that this fixes some issues with constant folding which
also reduces the code size for slub.
https://lkml.org/lkml/2012/10/20/82
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The kmalloc array is created in similar ways in both SLAB
and SLUB. Create a common function and have both allocators
call that function.
V1->V2:
Whitespace cleanup
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Have a common definition fo the kmalloc cache arrays in
SLAB and SLUB
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Standardize the constants that describe the smallest and largest
object kept in the kmalloc arrays for SLAB and SLUB.
Differentiate between the maximum size for which a slab cache is used
(KMALLOC_MAX_CACHE_SIZE) and the maximum allocatable size
(KMALLOC_MAX_SIZE, KMALLOC_MAX_ORDER).
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series. In short, slab_mutex
will be called from within the sysfs attribute store function. This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.
In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.
Lockdep info:
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
-------------------------------------------------------
trinity-child13/6961 is trying to acquire lock:
(s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60
but task is already holding lock:
(slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0x1aa/0x240
__mutex_lock_common+0x59/0x5a0
mutex_lock_nested+0x3f/0x50
slab_attr_store+0xde/0x110
sysfs_write_file+0xfa/0x150
vfs_write+0xb0/0x180
sys_pwrite64+0x60/0xb0
tracesys+0xe1/0xe6
-> #0 (s_active#43){++++.+}:
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#43);
lock(slab_mutex);
lock(s_active#43);
*** DEADLOCK ***
2 locks held by trinity-child13/6961:
#0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
#1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
stack backtrace:
Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
Call Trace:
print_circular_bug+0x1fb/0x20c
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch clarifies two aspects of cache attribute propagation.
First, the expected context for the for_each_memcg_cache macro in
memcontrol.h. The usages already in the codebase are safe. In mm/slub.c,
it is trivially safe because the lock is acquired right before the loop.
In mm/slab.c, it is less so: the lock is acquired by an outer function a
few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
indeed safe.
A comment is also added to detail why we are returning the value of the
parent cache and ignoring the children's when we propagate the attributes.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>