linux

Commit Graph

Author	SHA1	Message	Date
Atsushi Kumagai	13ba3fcbbe	kexec, vmalloc: export additional vmalloc layer information Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get the start address of vmalloc region (vmalloc_start). The address which contains vmalloc_start value is represented as below: vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start) However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list) aren't exported as VMCOREINFO. So this patch exports them externally with small cleanup. [akpm@linux-foundation.org: vmalloc.h should include list.h for list_head] Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:34 -07:00
Joonsoo Kim	f1c4069e1d	mm, vmalloc: export vmap_area_list, instead of vmlist Although our intention is to unexport internal structure entirely, but there is one exception for kexec. kexec dumps address of vmlist and makedumpfile uses this information. We are about to remove vmlist, then another way to retrieve information of vmalloc layer is needed for makedumpfile. For this purpose, we export vmap_area_list, instead of vmlist. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:34 -07:00
Josh Triplett	146732ce10	fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n drop_caches.c provides code only invokable via sysctl, so don't compile it in when CONFIG_SYSCTL=n. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:33 -07:00
Michal Hocko	6d2488f64a	cgroup: remove css_get_next Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:33 -07:00
Jiang Liu	e07cee23e6	mm,kexec: use common help functions to free reserved pages Use common help functions to free reserved pages. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Eric Biederman <ebiederm@xmission.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:31 -07:00
Chen Gang	12b2f117f3	kernel/audit_tree.c: tree will leak memory when failure occurs in audit_trim_trees() audit_trim_trees() calls get_tree(). If a failure occurs we must call put_tree(). [akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement] Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Chen Gang	373e0f3408	kernel/auditfilter.c: tree and watch will memory leak when failure occurs In audit_data_to_entry() when a failure occurs we must check and free the tree and watch to avoid a memory leak. test: plan: test command: "auditctl -a exit,always -w /etc -F auid=-1" (on fedora17, need modify auditctl to let "-w /etc" has effect) running: under fedora17 x86_64, 2 CPUs 3.20GHz, 2.5GB RAM. let 15 auditctl processes continue running at the same time. monitor command: watch -d -n 1 "cat /proc/meminfo \| awk '{print \$2}' \ \| head -n 4 \| xargs \ \| awk '{print \"used \",\$1 - \$2 - \$3 - \$4}'" result: for original version: will use up all memory, within 3 hours. kill all auditctl, the memory still does not free. for new version (apply this patch): after 14 hours later, not find issues. Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Gao feng	dde5b7d6e7	audit: remove unnecessary #if CONFIG_AUDIT The files which include kernel/audit.h are complied only when CONFIG_AUDIT is set. Just like audit_pid, there is no need to surround audit_ever_enabled with CONFIG_AUDIT. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Gao feng	374c586d95	audit: remove duplicate export of audit_enabled audit_enabled has already been exported in include/linux/audit.h. and kernel/audit.h includes include/linux/audit.h, no need to export aduit_enabled again in kernel/audit.h Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Gao feng	13f51e1c3f	audit: don't check if kauditd is valid every time We only need to check if kauditd is valid after we start it, if kauditd is invalid, we will set kauditd_task to NULL. So next time, we will start kauditd again. It means if kauditd_task is not NULL,it must be valid. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Rakib Mullick	3f68613f39	kernel/auditsc.c: use kzalloc instead of kmalloc+memset In audit_alloc_context() use kzalloc instead of kmalloc+memset. Also rename audit_zero_context() to audit_set_context(), to represent it's inner workings properly. [akpm@linux-foundation.org: remove audit_set_context() altogether - fold it into its caller] Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:26 -07:00
Oleg Nesterov	b5c5442bb6	kthread: kill task_get_live_kthread() task_get_live_kthread() looks confusing and unneeded. It does get_task_struct() but only kthread_stop() needs this, it can be called even if the calller doesn't have a reference when we know that this kthread can't exit until we do kthread_stop(). kthread_park() and kthread_unpark() do not need get_task_struct(), the callers already have the reference. And it can not help if we can race with the exiting kthread anyway, kthread_park() can hang forever in this case. Change kthread_park() and kthread_unpark() to use to_live_kthread(), change kthread_stop() to do get_task_struct() by hand and remove task_get_live_kthread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Namhyung Kim <namhyung@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:25 -07:00
Oleg Nesterov	4ecdafc808	kthread: introduce to_live_kthread() "k->vfork_done != NULL" with a barrier() after to_kthread(k) in task_get_live_kthread(k) looks unclear, and sub-optimal because we load ->vfork_done twice. All we need is to ensure that we do not return to_kthread(NULL). Add a new trivial helper which loads/checks ->vfork_done once, this also looks more understandable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Namhyung Kim <namhyung@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:25 -07:00
Linus Torvalds	916bb6d76d	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking changes from Ingo Molnar: "The most noticeable change are mutex speedups from Waiman Long, for higher loads. These scalability changes should be most noticeable on larger server systems. There are also cleanups, fixes and debuggability improvements." * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Consolidate bug messages into a single print_lockdep_off() function lockdep: Print out additional debugging advice when we hit lockdep BUGs mutex: Back out architecture specific check for negative mutex count mutex: Queue mutex spinners with MCS lock to reduce cacheline contention mutex: Make more scalable by doing less atomic operations mutex: Move mutex spinning code from sched/core.c back to mutex.c locking/rtmutex/tester: Set correct permissions on sysfs files lockdep: Remove unnecessary 'hlock_next' variable	2013-04-29 08:21:37 -07:00
Linus Torvalds	e09d13c4c8	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "This fix adds missing RCU read protection" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: events: Protect access via task_subsys_state_check()	2013-04-27 10:08:09 -07:00
Dave Jones	2c52283662	lockdep: Consolidate bug messages into a single print_lockdep_off() function Also add some missing printk levels. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130425174002.GA26769@redhat.com [ Tweaked the messages a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-26 08:37:22 +02:00
Dave Jones	199e371f59	lockdep: Print out additional debugging advice when we hit lockdep BUGs We occasionally get reports of these BUGs being hit, and the stack trace doesn't necessarily always tell us what we need to know about why we are hitting those limits. If users start attaching /proc/lock_stats to reports we may have more of a clue what's going on. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-26 08:36:33 +02:00
Rusty Russell	f83b293366	kernel/hz.bc: ignore. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-22 07:09:06 -07:00
Linus Torvalds	3125929454	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix offcore_rsp valid mask for SNB/IVB perf: Treat attr.config as u64 in perf_swevent_init()	2013-04-21 10:25:42 -07:00
Paul E. McKenney	c79aa0d965	events: Protect access via task_subsys_state_check() The following RCU splat indicates lack of RCU protection: [ 953.267649] =============================== [ 953.267652] [ INFO: suspicious RCU usage. ] [ 953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted [ 953.267661] ------------------------------- [ 953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage! [ 953.267669] [ 953.267669] other info that might help us debug this: [ 953.267669] [ 953.267675] [ 953.267675] rcu_scheduler_active = 1, debug_locks = 0 [ 953.267680] 1 lock held by glxgears/1289: [ 953.267683] #0: (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0 [ 953.267700] [ 953.267700] stack backtrace: [ 953.267704] Call Trace: [ 953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable) [ 953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180 [ 953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690 [ 953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0 [ 953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220 [ 953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0 ... This commit therefore adds the required RCU read-side critical section to perf_event_comm(). Reported-by: Adam Jackson <ajax@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: acme@ghostprotocols.net Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Gustavo Luiz Duarte <gusld@br.ibm.com>	2013-04-21 11:21:39 +02:00
Linus Torvalds	830ac8524f	Merge branch 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull kdump fixes from Peter Anvin: "The kexec/kdump people have found several problems with the support for loading over 4 GiB that was introduced in this merge cycle. This is partly due to a number of design problems inherent in the way the various pieces of kdump fit together (it is pretty horrifically manual in many places.) After a lot of iterations this is the patchset that was agreed upon, but of course it is now very late in the cycle. However, because it changes both the syntax and semantics of the crashkernel option, it would be desirable to avoid a stable release with the broken interfaces." I'm not happy with the timing, since originally the plan was to release the final 3.9 tomorrow. But apparently I'm doing an -rc8 instead... * 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kexec: use Crash kernel for Crash kernel low x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low x86, kdump: Retore crashkernel= to allocate under 896M x86, kdump: Set crashkernel_low automatically	2013-04-20 18:40:36 -07:00
Waiman Long	cc189d2513	mutex: Back out architecture specific check for negative mutex count Linus suggested that probably all the supported architectures can allow a negative mutex count without incorrect behavior, so we can then back out the architecture specific change and allow the mutex count to go to any negative number. That should further reduce contention for non-x86 architecture. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-19 09:33:36 +02:00
Waiman Long	2bd2c92cf0	mutex: Queue mutex spinners with MCS lock to reduce cacheline contention The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned on) allow multiple tasks to spin on a single mutex concurrently. A potential problem with the current approach is that when the mutex becomes available, all the spinning tasks will try to acquire the mutex more or less simultaneously. As a result, there will be a lot of cacheline bouncing especially on systems with a large number of CPUs. This patch tries to reduce this kind of contention by putting the mutex spinners into a queue so that only the first one in the queue will try to acquire the mutex. This will reduce contention and allow all the tasks to move forward faster. The queuing of mutex spinners is done using an MCS lock based implementation which will further reduce contention on the mutex cacheline than a similar ticket spinlock based implementation. This patch will add a new field into the mutex data structure for holding the MCS lock. This expands the mutex size by 8 bytes for 64-bit system and 4 bytes for 32-bit system. This overhead will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off. The following table shows the jobs per minute (JPM) scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the fserver workloads to 1/2/4/8 nodes with hyperthreading off. +-----------------+-----------+-----------+-------------+----------+ \| Configuration \| Mean JPM \| Mean JPM \| Mean JPM \| % Change \| \| \| w/o patch \| patch 1 \| patches 1&2 \| 1->1&2 \| +-----------------+------------------------------------------------+ \| \| User Range 1100 - 2000 \| +-----------------+------------------------------------------------+ \| 8 nodes, HT off \| 227972 \| 227237 \| 305043 \| +34.2% \| \| 4 nodes, HT off \| 393503 \| 381558 \| 394650 \| +3.4% \| \| 2 nodes, HT off \| 334957 \| 325240 \| 338853 \| +4.2% \| \| 1 node , HT off \| 198141 \| 197972 \| 198075 \| +0.1% \| +-----------------+------------------------------------------------+ \| \| User Range 200 - 1000 \| +-----------------+------------------------------------------------+ \| 8 nodes, HT off \| 282325 \| 312870 \| 332185 \| +6.2% \| \| 4 nodes, HT off \| 390698 \| 378279 \| 393419 \| +4.0% \| \| 2 nodes, HT off \| 336986 \| 326543 \| 340260 \| +4.2% \| \| 1 node , HT off \| 197588 \| 197622 \| 197582 \| 0.0% \| +-----------------+-----------+-----------+-------------+----------+ At low user range 10-100, the JPM differences were within +/-1%. So they are not that interesting. The fserver workload uses mutex spinning extensively. With just the mutex change in the first patch, there is no noticeable change in performance. Rather, there is a slight drop in performance. This mutex spinning patch more than recovers the lost performance and show a significant increase of +30% at high user load with the full 8 nodes. Similar improvements were also seen in a 3.8 kernel. The table below shows the %time spent by different kernel functions as reported by perf when running the fserver workload at 1500 users with all 8 nodes. +-----------------------+-----------+---------+-------------+ \| Function \| % time \| % time \| % time \| \| \| w/o patch \| patch 1 \| patches 1&2 \| +-----------------------+-----------+---------+-------------+ \| __read_lock_failed \| 34.96% \| 34.91% \| 29.14% \| \| __write_lock_failed \| 10.14% \| 10.68% \| 7.51% \| \| mutex_spin_on_owner \| 3.62% \| 3.42% \| 2.33% \| \| mspin_lock \| N/A \| N/A \| 9.90% \| \| __mutex_lock_slowpath \| 1.46% \| 0.81% \| 0.14% \| \| _raw_spin_lock \| 2.25% \| 2.50% \| 1.10% \| +-----------------------+-----------+---------+-------------+ The fserver workload for an 8-node system is dominated by the contention in the read/write lock. Mutex contention also plays a role. With the first patch only, mutex contention is down (as shown by the __mutex_lock_slowpath figure) which help a little bit. We saw only a few percents improvement with that. By applying patch 2 as well, the single mutex_spin_on_owner figure is now split out into an additional mspin_lock figure. The time increases from 3.42% to 11.23%. It shows a great reduction in contention among the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3 indicates that there are on average 4+ spinners waiting in the spin_lock loop for each spinner in the mutex_spin_on_owner loop. Contention in other locking functions also go down by quite a lot. The table below shows the performance change of both patches 1 & 2 over patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ \| Workload \| mean % change \| mean % change \| mean % change \| \| \| 10-100 users \| 200-1000 users \| 1100-2000 users \| +--------------+---------------+----------------+-----------------+ \| alltests \| 0.0% \| -0.8% \| +0.6% \| \| five_sec \| -0.3% \| +0.8% \| +0.8% \| \| high_systime \| +0.4% \| +2.4% \| +2.1% \| \| new_fserver \| +0.1% \| +14.1% \| +34.2% \| \| shared \| -0.5% \| -0.3% \| -0.4% \| \| short \| -1.7% \| -9.8% \| -8.3% \| +--------------+---------------+----------------+-----------------+ The short workload is the only one that shows a decline in performance probably due to the spinner locking and queuing overhead. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-19 09:33:36 +02:00
Waiman Long	0dc8c730c9	mutex: Make more scalable by doing less atomic operations In the __mutex_lock_common() function, an initial entry into the lock slow path will cause two atomic_xchg instructions to be issued. Together with the atomic decrement in the fast path, a total of three atomic read-modify-write instructions will be issued in rapid succession. This can cause a lot of cache bouncing when many tasks are trying to acquire the mutex at the same time. This patch will reduce the number of atomic_xchg instructions used by checking the counter value first before issuing the instruction. The atomic_read() function is just a simple memory read. The atomic_xchg() function, on the other hand, can be up to 2 order of magnitude or even more in cost when compared with atomic_read(). By using atomic_read() to check the value first before calling atomic_xchg(), we can avoid a lot of unnecessary cache coherency traffic. The only downside with this change is that a task on the slow path will have a tiny bit less chance of getting the mutex when competing with another task in the fast path. The same is true for the atomic_cmpxchg() function in the mutex-spin-on-owner loop. So an atomic_read() is also performed before calling atomic_cmpxchg(). The mutex locking and unlocking code for the x86 architecture can allow any negative number to be used in the mutex count to indicate that some tasks are waiting for the mutex. I am not so sure if that is the case for the other architectures. So the default is to avoid atomic_xchg() if the count has already been set to -1. For x86, the check is modified to include all negative numbers to cover a larger case. The following table shows the jobs per minutes (JPM) scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the high_systime workloads to 1/2/4/8 nodes with hyperthreading on and off. +-----------------+-----------+------------+----------+ \| Configuration \| Mean JPM \| Mean JPM \| % Change \| \| \| w/o patch \| with patch \| \| +-----------------+-----------------------------------+ \| \| User Range 1100 - 2000 \| +-----------------+-----------------------------------+ \| 8 nodes, HT on \| 36980 \| 148590 \| +301.8% \| \| 8 nodes, HT off \| 42799 \| 145011 \| +238.8% \| \| 4 nodes, HT on \| 61318 \| 118445 \| +51.1% \| \| 4 nodes, HT off \| 158481 \| 158592 \| +0.1% \| \| 2 nodes, HT on \| 180602 \| 173967 \| -3.7% \| \| 2 nodes, HT off \| 198409 \| 198073 \| -0.2% \| \| 1 node , HT on \| 149042 \| 147671 \| -0.9% \| \| 1 node , HT off \| 126036 \| 126533 \| +0.4% \| +-----------------+-----------------------------------+ \| \| User Range 200 - 1000 \| +-----------------+-----------------------------------+ \| 8 nodes, HT on \| 41525 \| 122349 \| +194.6% \| \| 8 nodes, HT off \| 49866 \| 124032 \| +148.7% \| \| 4 nodes, HT on \| 66409 \| 106984 \| +61.1% \| \| 4 nodes, HT off \| 119880 \| 130508 \| +8.9% \| \| 2 nodes, HT on \| 138003 \| 133948 \| -2.9% \| \| 2 nodes, HT off \| 132792 \| 131997 \| -0.6% \| \| 1 node , HT on \| 116593 \| 115859 \| -0.6% \| \| 1 node , HT off \| 104499 \| 104597 \| +0.1% \| +-----------------+------------+-----------+----------+ At low user range 10-100, the JPM differences were within +/-1%. So they are not that interesting. AIM7 benchmark run has a pretty large run-to-run variance due to random nature of the subtests executed. So a difference of less than +-5% may not be really significant. This patch improves high_systime workload performance at 4 nodes and up by maintaining transaction rates without significant drop-off at high node count. The patch has practically no impact on 1 and 2 nodes system. The table below shows the percentage time (as reported by perf record -a -s -g) spent on the __mutex_lock_slowpath() function by the high_systime workload at 1500 users for 2/4/8-node configurations with hyperthreading off. +---------------+-----------------+------------------+---------+ \| Configuration \| %Time w/o patch \| %Time with patch \| %Change \| +---------------+-----------------+------------------+---------+ \| 8 nodes \| 65.34% \| 0.69% \| -99% \| \| 4 nodes \| 8.70% \| 1.02% \| -88% \| \| 2 nodes \| 0.41% \| 0.32% \| -22% \| +---------------+-----------------+------------------+---------+ It is obvious that the dramatic performance improvement at 8 nodes was due to the drastic cut in the time spent within the __mutex_lock_slowpath() function. The table below show the improvements in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ \| Workload \| mean % change \| mean % change \| mean % change \| \| \| 10-100 users \| 200-1000 users \| 1100-2000 users \| +--------------+---------------+----------------+-----------------+ \| alltests \| +0.6% \| +104.2% \| +185.9% \| \| five_sec \| +1.9% \| +0.9% \| +0.9% \| \| fserver \| +1.4% \| -7.7% \| +5.1% \| \| new_fserver \| -0.5% \| +3.2% \| +3.1% \| \| shared \| +13.1% \| +146.1% \| +181.5% \| \| short \| +7.4% \| +5.0% \| +4.2% \| +--------------+---------------+----------------+-----------------+ Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Norton: Scott J <scott.norton@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-19 09:33:35 +02:00
Waiman Long	41fcb9f230	mutex: Move mutex spinning code from sched/core.c back to mutex.c As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature bit was really just an early hack to make with/without mutex-spinning testable. So it is no longer necessary. This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the mutex spinning code from kernel/sched/core.c back to kernel/mutex.c which is where they should belong. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-19 09:33:34 +02:00
Linus Torvalds	6835039d7e	Merge branch 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux Pull user-namespace fixes from Andy Lutomirski. * 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux: userns: Changing any namespace id mappings should require privileges userns: Check uid_map's opener's fsuid, not the current fsuid userns: Don't let unprivileged users trick privileged users into setting the id_map	2013-04-18 18:09:12 -07:00
Linus Torvalds	0a82a8d132	Revert "block: add missing block_bio_complete() tracepoint" This reverts commit `3a366e614d`. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-18 09:00:26 -07:00
Masami Hiramatsu	5c51543b0a	kprobes: Fix a double lock bug of kprobe_mutex Fix a double locking bug caused when debug.kprobe-optimization=0. While the proc_kprobes_optimization_handler locks kprobe_mutex, wait_for_kprobe_optimizer locks it again and that causes a double lock. To fix the bug, this introduces different mutex for protecting sysctl parameter and locks it in proc_kprobes_optimization_handler. Of course, since we need to lock kprobe_mutex when touching kprobes resources, that is done in *optimize_all_kprobes(). This bug was introduced by commit `ad72b3bea7` ("kprobes: fix wait_for_kprobe_optimizer()") Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-18 08:58:38 -07:00
Emese Revfy	b9e146d8eb	kernel/signal.c: stop info leak via the tkill and the tgkill syscalls This fixes a kernel memory contents leak via the tkill and tgkill syscalls for compat processes. This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field when handling signals delivered from tkill. The place of the infoleak: int copy_siginfo_to_user32(compat_siginfo_t __user to, siginfo_t from) { ... put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); ... } Signed-off-by: Emese Revfy <re.emese@gmail.com> Reviewed-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-17 16:10:45 -07:00
Yinghai Lu	157752d84f	kexec: use Crash kernel for Crash kernel low We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem instead. So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem. Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:34 -07:00
Yinghai Lu	adbc742bf7	x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low Per hpa, use crashkernel=X,high crashkernel=Y,low instead of crashkernel_hign=X crashkernel_low=Y. As that could be extensible. -v2: according to Vivek, change delimiter to ; -v3: let hign and low only handle simple form and it conforms to description in kernel-parameters.txt still keep crashkernel=X override any crashkernel=X,high crashkernel=Y,low -v4: update get_last_crashkernel returning and add more strict checking in parse_crashkernel_simple() found by HATAYAMA. -v5: Change delimiter back to , according to HPA. also separate parse_suffix from parse_simper according to vivek. so we can avoid @pos in that path. -v6: Tight the checking about crashkernel=X,highblahblah,high found by HTYAYAMA. Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:33 -07:00
Yinghai Lu	55a20ee780	x86, kdump: Retore crashkernel= to allocate under 896M Vivek found old kexec-tools does not work new kernel anymore. So change back crashkernel= back to old behavoir, and add crashkernel_high= to let user decide if buffer could be above 4G, and also new kexec-tools will be needed. -v2: let crashkernel=X override crashkernel_high= update description about _high will be ignored by crashkernel=X -v3: update description about kernel-parameters.txt according to Vivek. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:33 -07:00
Linus Torvalds	bb33db7a07	Merge branches 'timers-urgent-for-linus', 'irq-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull {timer,irq,core} fixes from Thomas Gleixner: - timer: bug fix for a cpu hotplug race. - irq: single bugfix for a wrong return value, which prevents the calling function to invoke the software fallback. - core: bugfix which plugs two race confitions which can cause hotplug per cpu threads to end up on the wrong cpu. * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Don't reinitialize a cpu_base lock on CPU_UP * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: gic: fix irq_trigger return * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kthread: Prevent unpark race which puts threads on the wrong cpu	2013-04-15 07:03:01 -07:00
Tommi Rantala	8176cced70	perf: Treat attr.config as u64 in perf_swevent_init() Trinity discovered that we fail to check all 64 bits of attr.config passed by user space, resulting to out-of-bounds access of the perf_swevent_enabled array in sw_perf_event_destroy(). Introduced in commit `b0a873ebb` ("perf: Register PMU implementations"). Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-15 11:42:12 +02:00
Andy Lutomirski	41c21e351e	userns: Changing any namespace id mappings should require privileges Changing uid/gid/projid mappings doesn't change your id within the namespace; it reconfigures the namespace. Unprivileged programs should not be able to write these files. (We're also checking the privileges on the wrong task.) Given the write-once nature of these files and the other security checks, this is likely impossible to usefully exploit. Signed-off-by: Andy Lutomirski <luto@amacapital.net>	2013-04-14 18:11:32 -07:00
Andy Lutomirski	e3211c120a	userns: Check uid_map's opener's fsuid, not the current fsuid Signed-off-by: Andy Lutomirski <luto@amacapital.net>	2013-04-14 18:11:31 -07:00
Eric W. Biederman	6708075f10	userns: Don't let unprivileged users trick privileged users into setting the id_map When we require privilege for setting /proc/<pid>/uid_map or /proc/<pid>/gid_map no longer allow an unprivileged user to open the file and pass it to a privileged program to write to the file. Instead when privilege is required require both the opener and the writer to have the necessary capabilities. I have tested this code and verified that setting /proc/<pid>/uid_map fails when an unprivileged user opens the file and a privielged user attempts to set the mapping, that unprivileged users can still map their own id, and that a privileged users can still setup an arbitrary mapping. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net>	2013-04-14 18:11:14 -07:00
Linus Torvalds	af788e35bf	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixlets" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Fix accounting on multi-threaded processes sched/debug: Fix sd->*_idx limit range avoiding overflow sched_clock: Prevent 64bit inatomicity on 32bit systems sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s	2013-04-14 11:12:17 -07:00
Linus Torvalds	ae9f4939ba	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix error return code ftrace: Fix strncpy() use, use strlcpy() instead of strncpy() perf: Fix strncpy() use, use strlcpy() instead of strncpy() perf: Fix strncpy() use, always make sure it's NUL terminated perf: Fix ring_buffer perf_output_space() boundary calculation perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer()	2013-04-14 11:10:44 -07:00
Linus Torvalds	3c91930f0c	Namhyung Kim found and fixed a bug that can crash the kernel by simply doing: echo 1234 \| tee -a /sys/kernel/debug/tracing/set_ftrace_pid Luckily, this can only be done by root, but still is a nasty bug. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJRaK2+AAoJEOdOSU1xswtMw48IAJPcSNMl1+epx5cPw8pwf+y6 YYvs/Ud3BMPBL+mpNPGNFWY+dWJsAtCtAgkLi0WgdL+b9iPNZrmQqqcP5xWV4uKV vRX2SPCQcyEn5keNnFdN3fN1R0+Gj4V8kLvxPqugzNrO9EHejx+TJFWjrONzkcSy g90lY45jfGWW0OS4GuSwHFhKDgcx8/kgb4Whv+xrKzTuX2QkU1BhG9WPsjiHWiL5 WRYjC4LWafrWaPd4cIkzMqj1eU/hL8BkiLLQHM1Tw8yD7t8OPzgmuJMZEh6Cx1iW /Xrm5QkNEcqQ/vSAC6aWUi22VEgRYDLg8WjngwuMgY1Qa3LE2ex8cUDyk7lJbas= =SFA8 -----END PGP SIGNATURE----- Merge tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fixes from Steven Rostedt: "Namhyung Kim found and fixed a bug that can crash the kernel by simply doing: echo 1234 \| tee -a /sys/kernel/debug/tracing/set_ftrace_pid Luckily, this can only be done by root, but still is a nasty bug." * tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section tracing: Fix possible NULL pointer dereferences	2013-04-14 10:50:55 -07:00
Linus Torvalds	935d8aabd4	Add file_ns_capable() helper function for open-time capability checking Nothing is using it yet, but this will allow us to delay the open-time checks to use time, without breaking the normal UNIX permission semantics where permissions are determined by the opener (and the file descriptor can then be passed to a different process, or the process can drop capabilities). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-14 10:06:31 -07:00
Steven Rostedt (Red Hat)	7f49ef69db	ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the ftrace_pid_fops is defined when DYNAMIC_FTRACE is not. Cc: stable@vger.kernel.org Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-04-12 17:12:41 -04:00
Namhyung Kim	6a76f8c0ab	tracing: Fix possible NULL pointer dereferences Currently set_ftrace_pid and set_graph_function files use seq_lseek for their fops. However seq_open() is called only for FMODE_READ in the fops->open() so that if an user tries to seek one of those file when she open it for writing, it sees NULL seq_file and then panic. It can be easily reproduced with following command: $ cd /sys/kernel/debug/tracing $ echo 1234 \| sudo tee -a set_ftrace_pid In this example, GNU coreutils' tee opens the file with fopen(, "a") and then the fopen() internally calls lseek(). Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: stable@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-04-12 14:43:34 -04:00
Thomas Gleixner	f2530dc71c	kthread: Prevent unpark race which puts threads on the wrong cpu The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2013-04-12 14:18:43 +02:00
Wei Yongjun	c481420248	perf: Fix error return code Fix to return -ENOMEM in the allocation error case instead of 0 (if pmu_bus_running == 1), as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: acme@ghostprotocols.net Link: http://lkml.kernel.org/r/CAPgLHd8j_fWcgqe%3DKLWjpBj%2B%3Do0Pw6Z-SEq%3DNTPU08c2w1tngQ@mail.gmail.com [ Tweaked the error code setting placement and the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-12 06:33:56 +02:00
Linus Torvalds	a3ab02b4c5	Power management fixes for 3.9-rc7 - System reboot/halt fix related to CPU offline ordering from Huacai Chen. - intel_pstate driver fix for a delay time computation error occasionally crashing systems using it from Dirk Brandewie. / -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABAgAGBQJRZynOAAoJEKhOf7ml8uNsYWcQAIZIps7Ivn2+r3ENL+jhTohx ErEz/cu/YIS/TnDzO3GO+Yo9CcXjUebMWqefIC//YK/K+tNepVOLovthTGiA/X36 23RDRrF1hqZlEgiEfFpuXiyq9u33CbUCYt75tsBXhxJkxeG7J7JfiG4AUh8dED4B nUCbQ4jWM7r9DYJFl2gjDkFt1SjG/UbxcN9Kua9v4zfJil9fKp9093HHYBHH3a2n zXlAE7CskXrNOepwp9Efzu5uPU3gbkIiQdKxvUs91remAcZ3fMsbz8CerZlgfy1S +3f4AuU9i2AXeYI5fanhLo6Mwm8jqBvZ8ZE4Fh/EuQs9eHk7VuRsy7n22zaVeU0A efaldd/pdP7KbSv5Wrs8adQr3GcRHkuHnMGhTlp41tfR8gJfpZUrK3/6h/jnIPRC 1UnBAF4K67v85fBO6gnC8UhEp3MXXXZoPtPByGILxj34KVn+oHzrVgE+8+ugv7HM ZJ5jobYPWrxI2lZv5kuBdHCVg2TAC3YUz2aev8cEhIo4vdcIC2cofVDyAcN9ArqF aF6fcNr6Rgu/M6bB2bP/zbhmDApr8H8z952jss51gprJ+IiKNUh9daiFnYw+o391 9VVTolC7k6P4pXTbtgqEFDLTJ0dKD8i/J4RLHwIsX7jzVgLctyqKZsNXskovjH4p jqIxu/1SPxR2dtBziUEH =bkFT -----END PGP SIGNATURE----- Merge tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: - System reboot/halt fix related to CPU offline ordering from Huacai Chen. - intel_pstate driver fix for a delay time computation error occasionally crashing systems using it from Dirk Brandewie. * tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / reboot: call syscore_shutdown() after disable_nonboot_cpus() cpufreq / intel_pstate: Set timer timeout correctly	2013-04-11 20:33:38 -07:00
Linus Torvalds	9baba6660b	Namhyung Kim fixed a long standing bug that can cause a kernel panic. If the function profiler fails to allocate memory for everything, it will do a double free on the same pointer which can cause a panic. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJRZdpLAAoJEOdOSU1xswtMyKgH/12ep1nFAYvXQQ04vcV3stCV 7vgk6oDMAGSYgwV2eNUbHNm2zkQBifFxUWLqWyzCd9t4RZUiIv5QHd2a+N2Ta+Xp Do8zhwod3vzSaZsM3JvQRK5q8U6R72dqroPiv+lJ+jh7cIPdHCm87P+ZPYgAgpfv 6J80Vk34q/HdEGEmNuQgLzgfB+sfld/Ob6Te69f1rmzqCfHCytY1i3R0iPWvaI/v B8R5cosjDhm0hAljsFlZb2Vl1jb89ByTgX3dL5Ph3O+hnHPCWE+ZQtbLCaOBV9F0 z8glXmAu2XVhv++0d21ul/TddQhVYQYF+ZMawxUlnLVKZ/J66c3l9Omhwf33Wz4= =+NqN -----END PGP SIGNATURE----- Merge tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Namhyung Kim fixed a long standing bug that can cause a kernel panic. If the function profiler fails to allocate memory for everything, it will do a double free on the same pointer which can cause a panic" * tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix double free when function profile init failed	2013-04-10 15:56:57 -07:00
Sasha Levin	8184004ed7	locking/rtmutex/tester: Set correct permissions on sysfs files sysfs started complaining about cases where permissions don't match what's in the sysfs ops structure (such as allowing read without a "show" callback). Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: williams@redhat.com Link: http://lkml.kernel.org/r/1363795105-5884-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 14:48:37 +02:00
Namhyung Kim	83e03b3fe4	tracing: Fix double free when function profile init failed On the failure path, stat->start and stat->pages will refer same page. So it'll attempt to free the same page again and get kernel panic. Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: stable@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-04-09 18:54:04 -04:00
Linus Torvalds	5f2f280f87	This includes three fixes. Two fix features added in 3.9 and one fixes a long time minor bug. The first patch fixes a race that can happen if the user switches from the irqsoff tracer to another tracer. If a irqs off latency is detected, it will try to use the snapshot buffer, but the new tracer wont have it allocated. There's a nasty warning that gets printed and the trace is ignored. Nothing crashes, just a nasty WARN_ON is shown. The second patch fixes an issue where if the sysctl is used to disable and enable function tracing, it can put the function tracing into an unstable state. The third patch fixes an issue with perf using the function tracer. An update was done, where the stub function could be called during the perf function tracing, and that stub function wont have the "control" flag set and cause a nasty warning when running perf. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJRYyyXAAoJEOdOSU1xswtMMtQH/0Ks494IyC9zAcSFZXJGagc2 bV1k2WrHUuXZnDEP3DIrwS87YwYOYD6l/7TW7AUc2AsFIgwsQ8tP+riI2FZVduAs LLKR3NxE8B8hi+QS7fbEXea6jcRX2I+gnsv8bLenDVbliCWs1wZbSo8jbyOFjpKa AWRpjIIBmKYB/dGn87YVOLAYHiMUO5WScKwJV0bCL9m5r2/7a1nu1j8KiQ9N0Vun 43jimIHYDlI/eSOGNIJPFAc/zjPXlPDFrpGcPg6wgUDfwSO0Cbz2PM46uxen+s91 Z4mbiqEONSTcl/wKYx9s6zRY+brkvP3AK0d1x1Al+TkTeFeaVPkTwmKSI/e46ow= =9Ide -----END PGP SIGNATURE----- Merge tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This includes three fixes. Two fix features added in 3.9 and one fixes a long time minor bug. The first patch fixes a race that can happen if the user switches from the irqsoff tracer to another tracer. If a irqs off latency is detected, it will try to use the snapshot buffer, but the new tracer wont have it allocated. There's a nasty warning that gets printed and the trace is ignored. Nothing crashes, just a nasty WARN_ON is shown. The second patch fixes an issue where if the sysctl is used to disable and enable function tracing, it can put the function tracing into an unstable state. The third patch fixes an issue with perf using the function tracer. An update was done, where the stub function could be called during the perf function tracing, and that stub function wont have the "control" flag set and cause a nasty warning when running perf." * tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Do not call stub functions in control loop ftrace: Consistently restore trace function on sysctl enabling tracing: Fix race with update_max_tr_single and changing tracers	2013-04-08 15:14:11 -07:00

1 2 3 4 5 ...

15183 Commits