Commit Graph

17820 Commits

Author SHA1 Message Date
Srivatsa S. Bhat c270a81719 profile: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();

Fix the profile code by using this latter form of callback registration.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:48 +01:00
Srivatsa S. Bhat d39ad278a3 trace, ring-buffer: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();

Fix the tracing ring-buffer code by using this latter form of callback
registration.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:48 +01:00
Srivatsa S. Bhat 93ae4f978c CPU hotplug: Provide lockless versions of callback registration functions
The following method of CPU hotplug callback registration is not safe
due to the possibility of an ABBA deadlock involving the cpu_add_remove_lock
and the cpu_hotplug.lock.

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

The deadlock is shown below:

          CPU 0                                         CPU 1
          -----                                         -----

   Acquire cpu_hotplug.lock
   [via get_online_cpus()]

                                              CPU online/offline operation
                                              takes cpu_add_remove_lock
                                              [via cpu_maps_update_begin()]

   Try to acquire
   cpu_add_remove_lock
   [via register_cpu_notifier()]

                                              CPU online/offline operation
                                              tries to acquire cpu_hotplug.lock
                                              [via cpu_hotplug_begin()]

                            *** DEADLOCK! ***

The problem here is that callback registration takes the locks in one order
whereas the CPU hotplug operations take the same locks in the opposite order.
To avoid this issue and to provide a race-free method to register CPU hotplug
callbacks (along with initialization of already online CPUs), introduce new
variants of the callback registration APIs that simply register the callbacks
without holding the cpu_add_remove_lock during the registration. That way,
we can avoid the ABBA scenario. However, we will need to hold the
cpu_add_remove_lock throughout the entire critical section, to protect updates
to the callback/notifier chain.

This can be achieved by writing the callback registration code as follows:

	cpu_maps_update_begin(); [ or cpu_notifier_register_begin(); see below ]

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* This doesn't take the cpu_add_remove_lock */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_maps_update_done();  [ or cpu_notifier_register_done(); see below ]

Note that we can't use get_online_cpus() here instead of cpu_maps_update_begin()
because the cpu_hotplug.lock is dropped during the invocation of CPU_POST_DEAD
notifiers, and hence get_online_cpus() cannot provide the necessary
synchronization to protect the callback/notifier chains against concurrent
reads and writes. On the other hand, since the cpu_add_remove_lock protects
the entire hotplug operation (including CPU_POST_DEAD), we can use
cpu_maps_update_begin/done() to guarantee proper synchronization.

Also, since cpu_maps_update_begin/done() is like a super-set of
get/put_online_cpus(), the former naturally protects the critical sections
from concurrent hotplug operations.

Since the names cpu_maps_update_begin/done() don't make much sense in CPU
hotplug callback registration scenarios, we'll introduce new APIs named
cpu_notifier_register_begin/done() and map them to cpu_maps_update_begin/done().

In summary, introduce the lockless variants of un/register_cpu_notifier() and
also export the cpu_notifier_register_begin/done() APIs for use by modules.
This way, we provide a race-free way to register hotplug callbacks as well as
perform initialization for the CPUs that are already online.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:40 +01:00
Gautham R. Shenoy a19423b987 CPU hotplug: Add lockdep annotations to get/put_online_cpus()
Add lockdep annotations for get/put_online_cpus() and
cpu_hotplug_begin()/cpu_hotplug_end().

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:40 +01:00
Rafael J. Wysocki 36cc86e8ec Merge branches 'pm-runtime' and 'pm-sleep'
* pm-runtime:
  PM / Runtime: Update runtime_idle() documentation for return value meaning

* pm-sleep:
  PM / sleep: Correct whitespace errors in <linux/pm.h>
  PM: Add missing "freeze" state
  PM / Hibernate: Spelling s/anonymouns/anonymous/
  PM / Runtime: Add missing "it" in comment
  PM / suspend: Remove unnecessary !!
  PCI / PM: Resume runtime-suspended devices later during system suspend
  ACPI / PM: Resume runtime-suspended devices later during system suspend
  PM / sleep: Set pm_generic functions to NULL for !CONFIG_PM_SLEEP
  PM: fix typo in comment
  PM / hibernate: use name_to_dev_t to parse resume
  PM / wakeup: Include appropriate header file in kernel/power/wakelock.c
  PM / sleep: Move prototype declaration to header file kernel/power/power.h
  PM / sleep: Asynchronous threads for suspend_late
  PM / sleep: Asynchronous threads for suspend_noirq
  PM / sleep: Asynchronous threads for resume_early
  PM / sleep: Asynchronous threads for resume_noirq
  PM / sleep: Two flags for async suspend_noirq and suspend_late
2014-03-20 13:25:54 +01:00
Rafael J. Wysocki 165f5fd04a Merge branches 'pm-qos', 'pm-domains' and 'pm-drivers'
* pm-qos:
  PM / QoS: Add type to dev_pm_qos_add_ancestor_request() arguments
  ACPI / LPSS: Support for device latency tolerance PM QoS
  ACPI / scan: Add bind/unbind callbacks to struct acpi_scan_handler
  PM / QoS: Introcuce latency tolerance device PM QoS type
  PM / QoS: Add no_constraints_value field to struct pm_qos_constraints
  PM / QoS: Rename device resume latency QoS items

* pm-domains:
  PM / domains: Turn latency warning into debug message

* pm-drivers:
  PM: Add pm_runtime_suspend|resume_force functions
  PM / runtime: Fetch runtime PM callbacks using a macro
2014-03-20 13:25:36 +01:00
Viresh Kumar 6201b4d61f timer: Remove code redundancy while calling get_nohz_timer_target()
There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
call it under same circumstances, i.e.

	#ifdef CONFIG_NO_HZ_COMMON
	       if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
	               return get_nohz_timer_target();
	#endif

So, it makes more sense to get all this as part of get_nohz_timer_target()
instead of duplicating code at two places. For this another parameter is
required to be passed to this routine, pinned.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-20 12:35:46 +01:00
Viresh Kumar c41eba7de1 timer: Use variable head instead of &work_list in __run_timers()
We already have a variable 'head' that points to '&work_list', and so
we should use that instead wherever possible.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/0d8645a6efc8360c4196c9797d59343abbfdcc5e.1395129136.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-20 12:35:45 +01:00
Linus Torvalds 7c3895383f Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "One really late cgroup patch to fix error path in create_css().
  Hitting this bug would be pretty rare but still possible and it gets
  delayed we'd need to backport it through -stable anyway.  It only
  updates error path in create_css() and has low chance of new
  breakages"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix a failure path in create_css()
2014-03-19 16:13:40 -07:00
Tejun Heo 1b9aba49ea cgroup: fix cgroup_taskset walking order
cgroup_taskset is used to track and iterate target tasks while
migrating a task or process and should guarantee that the first task
iterated is the task group leader if a process is being migrated.

b3dc094e93 ("cgroup: use css_set->mg_tasks to track target tasks
during migration") replaced flex array cgroup_taskset->tc_array with
css_set->mg_tasks list to remove process size limit and dynamic
allocation during migration; unfortunately, it incorrectly used list
operations which don't preserve order breaking the guarantee that
cgroup_taskset_first() returns the leader for a process target.

Fix it by using order preserving list operations.  Note that as
multiple src_csets may map to a single dst_cset, the iteration order
may change across cgroup_task_migrate(); however, the leader is still
guaranteed to be the first entry.

The switch to list_splice_tail_init() at the end of cgroup_migrate()
isn't strictly necessary.  Let's still do it for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-19 17:43:21 -04:00
Bjorn Helgaas 6404e88e83 resources: Set type in __request_region()
We don't set the type (I/O, memory, etc.) of resources added by
__request_region(), which leads to confusing messages like this:

    address space collision: [io  0x1000-0x107f] conflicts with ACPI CPU throttle [??? 0x00001010-0x00001015 flags 0x80000000]

Set the type of a new resource added by __request_region() (used by
request_region() and request_mem_region()) to the type of its parent.  This
makes the resource tree internally consistent and fixes messages like the
above, where the ACPI CPU throttle resource really is an I/O port region,
but request_region() didn't fill in the type, so %pR didn't know how to
print it.

Sample dmesg showing the issue at the link below.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=71611
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-03-19 15:00:16 -06:00
Tejun Heo 8cbbf2c972 cgroup: implement CFTYPE_ONLY_ON_DFL
This cftype flag makes the file only appear on the default hierarchy.
This will later be used for cgroup.controllers file.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:55 -04:00
Tejun Heo a2dd424750 cgroup: make cgrp_dfl_root mountable
cgrp_dfl_root will be used as the default unified hierarchy.  This
patch makes cgrp_dfl_root mountable by making the following changes.

* cgroup_init_early() now initializes cgrp_dfl_root w/
  CGRP_ROOT_SANE_BEHAVIOR.  The default hierarchy is always sane.

* parse_cgroupfs_options() and cgroup_mount() are updated such that
  cgrp_dfl_root is mounted if sane_behavior is specified w/o any
  subsystems.

* rebind_subsystems() now populates the root directory of
  cgrp_dfl_root.  Note that the function still guarantees success of
  rebinding subsystems to cgrp_dfl_root.  If populating fails while
  rebinding to cgrp_dfl_root, it whines but ignores the error.

* For backward compatibility, the default hierarchy shows up in
  /proc/$PID/cgroup only after it's explicitly mounted so that
  userland which doesn't make use of it doesn't see any change.

* "current_css_set_cg_links" file of debug cgroup now treats the
  default hierarchy the same as other hierarchies.  This is visible to
  userland.  Given that it's for debug controller, this should be
  fine.

* While at it, implement cgroup_on_dfl() which tests whether a give
  cgroup is on the default hierarchy or not.

The above changes make cgrp_dfl_root mostly equivalent to other
controllers but the actual unified hierarchy behaviors are not
implemented yet.  Let's plug child cgroup creation in cgrp_dfl_root
from create_cgroup() for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:55 -04:00
Tejun Heo 4d3bb511b5 cgroup: drop const from @buffer of cftype->write_string()
cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer.  The
only thing const achieves is unnecessarily complicating parsing of the
buffer.  Drop const from @buffer.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>                                           
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-03-19 10:23:54 -04:00
Tejun Heo 3dd06ffa9d cgroup: rename cgroup_dummy_root and related names
The dummy root will be repurposed to serve as the default unified
hierarchy.  Let's rename things in preparation.

* s/cgroup_dummy_root/cgrp_dfl_root/
* s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore
* s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:54 -04:00
Tejun Heo 944196278d cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroupfs_root->subsys_mask represents the controllers attached to the
hierarchy.  This patch moves the field to cgroup.  Subsystem
initialization and rebinding updates the top cgroup's subsys_mask.
For !root cgroups, the subsys_mask bits are set from create_css() and
cleared from kill_css(), which effectively means that all cgroups will
have the same subsys_mask as the top cgroup.

While this doesn't make any difference now, this will help
implementation of the default unified hierarchy where !root cgroups
may have subsets of the top_cgroup's subsys_mask.

While at it, __kill_css() is split out of kill_css().  The former
doesn't care about the subsys_mask while the latter becomes noop if
the controller is already killed and clears the matching bit if not
before proceeding to killing the css.  This will be used later by the
default unified hierarchy implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:54 -04:00
Tejun Heo 5df3603229 cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
Currently, while rebinding, cgroup_dummy_root serves as the anchor
point.  In addition to the target root, rebind_subsystems() takes
@added_mask and @removed_mask.  The subsystems specified in the former
are expected to be on the dummy root and then moved to the target
root.  The ones in the latter are moved from non-dummy root to dummy.
Now that the dummy root is a fully functional one and we're planning
to use it for the default unified hierarchy, this level of distinction
between dummy and non-dummy roots is quite awkward.

This patch updates rebind_subsystems() to take the target root and one
subsystem mask and move the specified subsystmes to the target root
which may or may not be the dummy root.  IOW, unbinding now becomes
moving the subsystems to the dummy root and binding to non-dummy root.
This makes the dummy root mostly equivalent to other hierarchies in
terms of the mechanism of moving subsystems around; however, we still
retain all the semantical restrictions so that this patch doesn't
introduce any visible behavior differences.  Another noteworthy detail
is that rebind_subsystems() guarantees that moving a subsystem to the
dummy root never fails so that valid unmounting attempts always
succeed.

This unifies binding and unbinding of subsystems.  The invocation
points of ->bind() were inconsistent between the two and now moved
after whole rebinding is complete.  This doesn't break the current
users and generally makes more sense.

All rebind_subsystems() users are converted accordingly.  Note that
cgroup_remount() now makes two calls to rebind_subsystems() to bind
and then unbind the requested subsystems.

This will allow repurposing of the dummy hierarchy as the default
unified hierarchy and shouldn't make any userland visible behavior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:54 -04:00
Tejun Heo 985ed67014 cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup_dummy_root is used to host controllers which aren't attached to
any other hierarchy.  The root is minimally set up during kernfs
bootstrap and didn't go through full hierarchy initialization.  We're
planning to use cgroup_dummy_root for the default unified hierarchy
and thus want it to be fully functional.

Replace the special initialization, which was collected into
cgroup_init() by the previous patch, with an invocation of
cgroup_setup_root().  This simplifies the init path and makes
cgroup_dummy_root a full hierarchy with its own kernfs_root and all.

As this puts the dummy hierarchy on the cgroup_roots list, rename
for_each_active_root() to for_each_root() and update its users to skip
the dummy root for now.

This patch doesn't cause any userland visible behavior changes at this
point.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:53 -04:00
Tejun Heo 172a2c0685 cgroup: reorganize cgroup bootstrapping
* Fields of init_css_set and css_set_count are now set using
  initializer instead of programmatically from cgroup_init_early().

* init_cgroup_root() now also takes @opts and performs the optional
  part of initialization too.  The leftover part of
  cgroup_root_from_opts() is collapsed into its only caller -
  cgroup_mount().

* Initialization of cgroup_root_count and linking of init_css_set are
  moved from cgroup_init_early() to to cgroup_init().  None of the
  early_init users depends on init_css_set being linked.

* Subsystem initializations are moved after dummy hierarchy init and
  init_css_set linking.

These changes reorganize the bootstrap logic so that the dummy
hierarchy can share the usual hierarchy init path and be made more
normal.  These changes don't make noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:53 -04:00
Tejun Heo 5d77381fd8 cgroup: relocate setting of CGRP_DEAD
In cgroup_destroy_locked(), move setting of CGRP_DEAD above
invocations of kill_css().  This doesn't make any visible behavior
difference now but will be used to inhibit manipulating controller
enable states of a dying cgroup on the unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-03-19 10:23:53 -04:00
Chema Gonzalez bab5c790cc genirq: procfs: Make smp_affinity values go+r
Includes:
- /proc/irq/default_smp_affinity
- /proc/irq/*/affinity_hint
- /proc/irq/*/smp_affinity
- /proc/irq/*/smp_affinity_list

Users can distill the same information by reading /proc/interrupts.

Signed-off-by: Chema Gonzalez <chema@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: http://lkml.kernel.org/r/1394765455-1217-1-git-send-email-chema@google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-19 12:34:29 +01:00
Thomas Gleixner d532676cc7 softirq: Add linux/irq.h to make it compile again
On Sparc and S390 the removal of irq.h from kernel_stat.h causes:

   kernel/softirq.c:774:9: error: 'NR_IRQS_LEGACY' undeclared

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-19 11:28:14 +01:00
Li Zefan 3eb59ec64f cgroup: fix a failure path in create_css()
If online_css() fails, we should remove cgroup files belonging
to css->ss.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-18 17:15:36 -04:00
David A. Long 6fe50a28ba uprobes: allow ignoring of probe hits
Allow arches to decided to ignore a probe hit.  ARM will use this to
only call handlers if the conditions to execute a conditionally executed
instruction are satisfied.

Signed-off-by: David A. Long <dave.long@linaro.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
2014-03-18 16:39:34 -04:00
David A. Long 09294e31b1 uprobes: Kconfig dependency fix
Suggested change from Oleg Nesterov. Fixes incomplete dependencies
for uprobes feature.

Signed-off-by: David A. Long <dave.long@linaro.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
2014-03-18 16:39:33 -04:00
Linus Torvalds 59bf6c3c6c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three small fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Prevent tracing recursion in sched_clock_cpu()
  stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus()
  sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy
2014-03-16 10:42:07 -07:00
Thomas Gleixner 328a4978df genirq: Add a new IRQCHIP_EOI_THREADED flag
The flag is necessary for interrupt chips which require an ACK/EOI
after the handler has run. In case of threaded handlers this needs to
happen after the threaded handler has completed before the unmask of
the interrupt.

The flag is only unseful in combination with the handle_fasteoi_irq
flow control handler.

It can be combined with the flag IRQCHIP_EOI_IF_HANDLED, so the EOI is
not issued when the interrupt is disabled or in progress.

Tested-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-sunxi@googlegroups.com
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Link: http://lkml.kernel.org/r/1394733834-26839-2-git-send-email-hdegoede@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-14 13:43:33 +01:00
Jens Axboe 89f8b33ca1 block: remove old blk_iopoll_enabled variable
This was a debugging measure to toggle enabled/disabled
when testing. But for real production setups, it's not
safe to toggle this setting without either reloading
drivers of quiescing IO first. Neither of which the toggle
enforces.

Additionally, it makes drivers deal with the conditional
state.

Remove it completely. It's up to the driver whether iopoll
is enabled or not.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-13 09:38:42 -06:00
Frederic Weisbecker 300a9d887e sched: Remove needless round trip nsecs <-> tick conversion of steal time
When update_rq_clock_task() accounts the pending steal time for a task,
it converts the steal delta from nsecs to tick then from tick to nsecs.

There is no apparent good reason for doing that though because both
the task clock and the prev steal delta are u64 and store values
in nsecs.

So lets remove the needless conversion.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-03-13 15:56:44 +01:00
Frederic Weisbecker dee08a72de cputime: Fix jiffies based cputime assumption on steal accounting
The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.

As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.

As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.

In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.

Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.

Reported-by: Huiqingding <huding@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-03-13 15:56:44 +01:00
Mathieu Desnoyers 66cc69e34e Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
Users have reported being unable to trace non-signed modules loaded
within a kernel supporting module signature.

This is caused by tracepoint.c:tracepoint_module_coming() refusing to
take into account tracepoints sitting within force-loaded modules
(TAINT_FORCED_MODULE). The reason for this check, in the first place, is
that a force-loaded module may have a struct module incompatible with
the layout expected by the kernel, and can thus cause a kernel crash
upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.

Tracepoints, however, specifically accept TAINT_OOT_MODULE and
TAINT_CRAP, since those modules do not lead to the "very likely system
crash" issue cited above for force-loaded modules.

With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
module is tainted re-using the TAINT_FORCED_MODULE taint flag.
Unfortunately, this means that Tracepoints treat that module as a
force-loaded module, and thus silently refuse to consider any tracepoint
within this module.

Since an unsigned module does not fit within the "very likely system
crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
to specifically address this taint behavior, and accept those modules
within Tracepoints. We use the letter 'X' as a taint flag character for
a module being loaded that doesn't know how to sign its name (proposed
by Steven Rostedt).

Also add the missing 'O' entry to trace event show_module_flags() list
for the sake of completeness.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
NAKed-by: Ingo Molnar <mingo@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-03-13 12:11:51 +10:30
Jiri Slaby 27bba4d6bb module: use pr_cont
When dumping loaded modules, we print them one by one in separate
printks. Let's use pr_cont as they are continuation prints.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-03-13 12:10:59 +10:30
Thomas Gleixner ffb12cf002 Merge branch 'irq/for-gpio' into irq/core
Merge the request/release callbacks which are in a separate branch for
consumption by the gpio folks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-12 16:01:07 +01:00
Thomas Gleixner c1bacbae81 genirq: Provide irq_request/release_resources chip callbacks
For certain irq types, e.g. gpios, it's necessary to request resources
before starting up the irq.

This might fail so we cannot use the irq_startup() callback because we
might call the irq_set_type() callback before that which does not make
sense when the resource is not available. Calling irq_startup() before
irq_set_type() can lead to spurious interrupts which is not desired
either.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org 
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1403080857160.18573@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-12 16:00:24 +01:00
Peter Zijlstra 6f008e72cd locking/mutex: Fix debug checks
OK, so commit:

  1d8fe7dc80 ("locking/mutexes: Unlock the mutex without the wait_lock")

generates this boot warning when CONFIG_DEBUG_MUTEXES=y:

  WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.c:82 debug_mutex_unlock+0x155/0x180() DEBUG_LOCKS_WARN_ON(lock->owner != current)

And that makes sense, because as soon as we release the lock a
new owner can come in...

One would think that !__mutex_slowpath_needs_to_unlock()
implementations suffer the same, but for DEBUG we fall back to
mutex-null.h which has an unconditional 1 for that.

The mutex debug code requires the mutex to be unlocked after
doing the debug checks, otherwise it can find inconsistent
state.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jason.low2@hp.com
Link: http://lkml.kernel.org/r/20140312122442.GB27965@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 13:49:47 +01:00
Alex Shi 6037dd1a49 sched: Clean up the task_hot() function
task_hot() doesn't need the 'sched_domain' parameter, so remove it.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394607111-1904-1-git-send-email-alex.shi@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 10:49:01 +01:00
Vincent Guittot a2cd42601b sched: Remove double calculation in fix_small_imbalance()
The tmp value has been already calculated in:

  scaled_busy_load_per_task =
		(busiest->load_per_task * SCHED_POWER_SCALE) /
		busiest->group_power;

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394555166-22894-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 10:49:00 +01:00
Steven Rostedt 383afd0971 sched: Fix broken setscheduler()
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:

   linux-next commit c365c292d0
   "sched: Consider pi boosting in setscheduler()"

And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:

 -       p->normal_prio = normal_prio(p);
 -       p->prio = rt_mutex_getprio(p);

With no

 +       p->normal_prio = normal_prio(p);
 +       p->prio = rt_mutex_getprio(p);

Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.

The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.

Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-12 10:48:59 +01:00
Sasha Levin d88471cb8b ftrace: Constify ftrace_text_reserved
Link: http://lkml.kernel.org/r/1357772960-4436-5-git-send-email-sasha.levin@oracle.com

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-11 22:52:43 -04:00
Mathieu Desnoyers 3bbc8db341 tracepoints: API doc update to tracepoint_probe_register() return value
Describe the return values of tracepoint_probe_register(), including
-ENODEV added by commit:

Author: Steven Rostedt <rostedt@goodmis.org>

    tracing: Warn if a tracepoint is not set via debugfs

Link: http://lkml.kernel.org/r/1394499898-1537-2-git-send-email-mathieu.desnoyers@efficios.com

CC: Ingo Molnar <mingo@kernel.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-11 21:53:50 -04:00
Mathieu Desnoyers 4c11628a16 tracepoints: API doc update to data argument
Describe the @data argument (probe private data).

Link: http://lkml.kernel.org/r/1394587948-27878-1-git-send-email-mathieu.desnoyers@efficios.com

Fixes: 38516ab59f "tracing: Let tracepoints have data passed to tracepoint callbacks"
CC: Ingo Molnar <mingo@kernel.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-11 21:51:40 -04:00
Geert Uytterhoeven af02b5fdb1 PM: Add missing "freeze" state
Fix descriptions of /sys/power/state in the documentation and in
a code comment.

Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
[rjw: Changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-12 00:54:53 +01:00
Geert Uytterhoeven 4d4348202b PM / Hibernate: Spelling s/anonymouns/anonymous/
Spelling fix.

Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-12 00:54:53 +01:00
Jiri Slaby db0fbadcbd ftrace: Fix compilation warning about control_ops_free
With CONFIG_DYNAMIC_FTRACE=n, I see a warning:
kernel/trace/ftrace.c:240:13: warning: 'control_ops_free' defined but not used
 static void control_ops_free(struct ftrace_ops *ops)
             ^
Move that function around to an already existing #ifdef
CONFIG_DYNAMIC_FTRACE block as the function is used solely from the
dynamic function tracing functions.

Link: http://lkml.kernel.org/r/1394484131-5107-1-git-send-email-jslaby@suse.cz

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-11 19:38:20 -04:00
Linus Torvalds adf961d7e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull audit namespace fixes from Eric Biederman:
 "Starting with 3.14-rc1 the audit code is faulty (think oopses and
  races) with respect to how it computes the network namespace of which
  socket to reply to, and I happened to notice by chance when reading
  through the code.

  My testing and the automated build bots don't find any problems with
  these fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  audit: Update kdoc for audit_send_reply and audit_list_rules_send
  audit: Send replies in the proper network namespace.
  audit: Use struct net not pid_t to remember the network namespce to reply in
2014-03-11 10:17:50 -07:00
Peter Zijlstra 34c6bc2c91 locking/mutexes: Add extra reschedule point
Add in an extra reschedule in an attempt to avoid getting reschedule
the moment we've acquired the lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-zah5eyn9gu7qlgwh9r6n2anc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:59 +01:00
Peter Zijlstra fb0527bd5e locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
Since we want a task waiting for a mutex_lock() to go to sleep and
reschedule on need_resched() we must be able to abort the
mcs_spin_lock() around the adaptive spin.

Therefore implement a cancelable mcs lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: chegu_vinod@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: Jason Low <jason.low2@hp.com>
Link: http://lkml.kernel.org/n/tip-62hcl5wxydmjzd182zhvk89m@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:56 +01:00
Jason Low 1d8fe7dc80 locking/mutexes: Unlock the mutex without the wait_lock
When running workloads that have high contention in mutexes on an 8 socket
machine, mutex spinners would often spin for a long time with no lock owner.

The main reason why this is occuring is in __mutex_unlock_common_slowpath(),
if __mutex_slowpath_needs_to_unlock(), then the owner needs to acquire the
mutex->wait_lock before releasing the mutex (setting lock->count to 1). When
the wait_lock is contended, this delays the mutex from being released.
We should be able to release the mutex without holding the wait_lock.

Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: chegu_vinod@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:54 +01:00
Jason Low 47667fa150 locking/mutexes: Modify the way optimistic spinners are queued
The mutex->spin_mlock was introduced in order to ensure that only 1 thread
spins for lock acquisition at a time to reduce cache line contention. When
lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
continually release and obtain the lock->spin_mlock. This can generate
quite a bit of overhead/contention, and also might just delay the spinner
from getting the lock.

This patch modifies the way optimistic spinners are queued by queuing before
entering the optimistic spinning loop as oppose to acquiring before every
call to mutex_spin_on_owner(). So in situations where the spinner requires
a few extra spins before obtaining the lock, then there will only be 1 spinner
trying to get the lock and it will avoid the overhead from unnecessarily
unlocking and locking the spin_mlock.

Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: chegu_vinod@hp.com
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:53 +01:00
Jason Low 46af29e479 locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
The mutex_can_spin_on_owner() function should also return false if the
task needs to be rescheduled to avoid entering the MCS queue when it
needs to reschedule.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: chegu_vinod@hp.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1390936396-3962-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:52 +01:00
Peter Zijlstra c9122da1e2 locking: Move mcs_spinlock.h into kernel/locking/
The mcs_spinlock code is not meant (or suitable) as a generic locking
primitive, therefore take it away from the normal includes and place
it in kernel/locking/.

This way the locking primitives implemented there can use it as part
of their implementation but we do not risk it getting used
inapropriately.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-byirmpamgr7h25m5kyavwpzx@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:14:52 +01:00
Mike Galbraith 156654f491 sched/numa: Move task_numa_free() to __put_task_struct()
Bad idea on -rt:

[  908.026136]  [<ffffffff8150ad6a>] rt_spin_lock_slowlock+0xaa/0x2c0
[  908.026145]  [<ffffffff8108f701>] task_numa_free+0x31/0x130
[  908.026151]  [<ffffffff8108121e>] finish_task_switch+0xce/0x100
[  908.026156]  [<ffffffff81509c0a>] thread_return+0x48/0x4ae
[  908.026160]  [<ffffffff8150a095>] schedule+0x25/0xa0
[  908.026163]  [<ffffffff8150ad95>] rt_spin_lock_slowlock+0xd5/0x2c0
[  908.026170]  [<ffffffff810658cf>] get_signal_to_deliver+0xaf/0x680
[  908.026175]  [<ffffffff8100242d>] do_signal+0x3d/0x5b0
[  908.026179]  [<ffffffff81002a30>] do_notify_resume+0x90/0xe0
[  908.026186]  [<ffffffff81513176>] int_signal+0x12/0x17
[  908.026193]  [<00007ff2a388b1d0>] 0x7ff2a388b1cf

and since upstream does not mind where we do this, be a bit nicer ...

Signed-off-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:43 +01:00
Kirill Tkhai 35805ff8f4 sched/fair: Fix endless loop in idle_balance()
Check for fair tasks number to decide, that we've pulled a task.
rq's nr_running may contain throttled RT tasks.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394118975.19290.104.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:41 +01:00
Kirill Tkhai 4c6c4e38c4 sched/core: Fix endless loop in pick_next_task()
1) Single cpu machine case.

When rq has only RT tasks, but no one of them can be picked
because of throttling, we enter in endless loop.

pick_next_task_{dl,rt} return NULL.

In pick_next_task_fair() we permanently go to retry

	if (rq->nr_running != rq->cfs.h_nr_running)
		return RETRY_TASK;

(rq->nr_running is not being decremented when rt_rq becomes
throttled).

No chances to unthrottle any rt_rq or to wake fair here,
because of rq is locked permanently and interrupts are
disabled.

2) In case of SMP this can cause a hang too. Although we unlock
   rq in idle_balance(), interrupts are still disabled.

The solution is to check for available tasks in DL and RT
classes instead of checking for sum.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:39 +01:00
Kirill Tkhai e4aa358b6c sched/fair: Push down check for high priority class task into idle_balance()
We close idle_exit_fair() bracket in case of we've pulled something or we've received
task of high priority class.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1394098315.19290.10.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:37 +01:00
Kirill Tkhai 734ff2a71f sched/rt: Fix picking RT and DL tasks from empty queue
The problems:

1) We check for rt_nr_running before call of put_prev_task().
   If previous task is RT, its rt_rq may become throttled
   and dequeued after this call.

In case of p is from rt->rq this just causes picking a task
from throttled queue, but in case of its rt_rq is child
we are guaranteed catch BUG_ON.

2) The same with deadline class. The only difference we operate
   on only dl_rq.

This patch fixes all the above problems and it adds a small skip in the
DL update like we've already done for RT class:

	if (unlikely((s64)delta_exec <= 0))
		return;

This will optimize sequential update_curr_dl() calls a little.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:35 +01:00
Jiri Olsa 63c45f4ba5 perf: Disallow user-space stack dumps for function trace events
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.

The user space stack dump is just another source of the this issue.

Related list discussions:
  http://marc.info/?t=139302086500001&r=1&w=2
  http://marc.info/?t=139301437300003&r=1&w=2

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:57:58 +01:00
Jiri Olsa cfa77bc4af perf: Disallow user-space callchains for function trace events
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.

Related list discussions:

  http://marc.info/?t=139302086500001&r=1&w=2
  http://marc.info/?t=139301437300003&r=1&w=2

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:57:57 +01:00
Ingo Molnar 0066f3b93e Merge branch 'perf/urgent' into perf/core
Merge the latest fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:53:50 +01:00
Daniel Lezcano a1d028bd6d sched/idle: Add more comments to the code
The idle main function is a complex and a critical function. Added more
comments to the code.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-5-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:52:49 +01:00
Daniel Lezcano 8ca3c6424f sched/idle: Move idle conditions in cpuidle_idle main function
This patch moves the condition before entering idle into the cpuidle main
function located in idle.c. That simplify the idle mainloop functions and
increase the readibility of the conditions to enter truly idle.

This patch is code reorganization and does not change the behavior of the
function.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-4-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:52:48 +01:00
Daniel Lezcano c8cc7d4de7 sched/idle: Reorganize the idle loop
Now that we have the main cpuidle function in idle.c, move some code from
the idle mainloop to this function for the sake of clarity.

That removes if then else indentation difficult to follow when looking at the
code. This patch does not change the current behavior.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:52:47 +01:00
Daniel Lezcano 30cdd69e2a cpuidle/idle: Move the cpuidle_idle_call function to idle.c
The cpuidle_idle_call does nothing more than calling the three individuals
function and is no longer used by any arch specific code but only in the
cpuidle framework code.

We can move this function into the idle task code to ensure better
proximity to the scheduler code.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:52:45 +01:00
Ingo Molnar a02ed5e3e0 Merge branch 'sched/urgent' into sched/core
Pick up fixes before queueing up new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:34:27 +01:00
Fernando Luis Vazquez Cao 96b3d28bf4 sched/clock: Prevent tracing recursion in sched_clock_cpu()
Prevent tracing of preempt_disable/enable() in sched_clock_cpu().
When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are
traced and this causes trace_clock() users (and probably others) to
go into an infinite recursion. Systems with a stable sched_clock()
are not affected.

This problem is similar to that fixed by upstream commit 95ef1e5292
("KVM guest: prevent tracing recursion with kvmclock").

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:33:48 +01:00
Peter Zijlstra 177c53d943 stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus()
We must use smp_call_function_single(.wait=1) for the
irq_cpu_stop_queue_work() to ensure the queueing is actually done under
stop_cpus_lock. Without this we could have dropped the lock by the time
we do the queueing and get the race we tried to fix.

Fixes: 7053ea1a34 ("stop_machine: Fix race between stop_two_cpus() and stop_cpus()")

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140228123905.GK3104@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:33:47 +01:00
Juri Lelli d44753b843 sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy
Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:33:46 +01:00
Viresh Kumar 7500d9363f PM / suspend: Remove unnecessary !!
Double ! or !! are normally required to get 0 or 1 out of a expression. A
comparision always returns 0 or 1 and hence there is no need to apply double !
over it again.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-11 01:45:21 +01:00
Johannes Weiner e97ca8e5b8 mm: fix GFP_THISNODE callers and clarify
GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes.  It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g.  through a subsequent allocation request
without GFP_THISNODE set.

However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary.  This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.

Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE.  This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:19 -07:00
Thomas Gleixner f9a8a0abc3 Merge branch 'fortglx/3.15/time' of git://git.linaro.org/people/john.stultz/linux into timers/core
- support CLOCK_BOOTTIME clock in timerfd
 - Add missing header file

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-10 19:53:09 +01:00
Eric W. Biederman d211f177b2 audit: Update kdoc for audit_send_reply and audit_list_rules_send
The kbuild test robot reported:
> tree:   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next
> head:   6f285b19d0
> commit: 6f285b19d0 [2/2] audit: Send replies in the proper network namespace.
> reproduce: make htmldocs
>
> >> Warning(kernel/audit.c:575): No description found for parameter 'request_skb'
> >> Warning(kernel/audit.c:575): Excess function parameter 'portid' description in 'audit_send_reply'
> >> Warning(kernel/auditfilter.c:1074): No description found for parameter 'request_skb'
> >> Warning(kernel/auditfilter.c:1074): Excess function parameter 'portid' description in 'audit_list_rules_s

Which was caused by my failure to update the kdoc annotations when I
updated the functions.  Fix that small oversight now.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-03-08 15:31:54 -08:00
Linus Torvalds ca62eec4e5 Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Two cpuset locking fixes from Li.  Both tagged for -stable"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: fix a race condition in __cpuset_node_allowed_softwall()
  cpuset: fix a locking issue in cpuset_migrate_mm()
2014-03-08 11:57:38 -08:00
Linus Torvalds 721f0c1260 In the past, I've had lots of reports about trace events not working.
Developers would say they put a trace_printk() before and after the trace
 event but when they enable it (and the trace event said it was enabled) they
 would see the trace_printks but not the trace event.
 
 I was not able to reproduce this, but that's because I wasn't looking at
 the right location. Recently, another bug came up that showed the issue.
 
 If your kernel supports signed modules but allows for non-signed modules
 to be loaded, then when one is, the kernel will silently set the
 MODULE_FORCED taint on the module. Although, this taint happens without
 the need for insmod --force or anything of the kind, it labels the
 module with that taint anyway.
 
 If this tainted module has tracepoints, the tracepoints will be ignored
 because of the MODULE_FORCED taint. But no error message will be
 displayed. Worse yet, the event infrastructure will still be created
 letting users enable the trace event represented by the tracepoint,
 although that event will never actually be enabled. This is because
 the tracepoint infrastructure allows for non-existing tracepoints to
 be enabled for new modules to arrive and have their tracepoints set.
 
 Although there are several things wrong with the above, this change
 only addresses the creation of the trace event files for tracepoints
 that are not created when a module is loaded and is tainted. This change
 will print an error message about the module being tainted and not the
 trace events will not be created, and it does not create the trace event
 infrastructure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTFnMPAAoJEKQekfcNnQGuPPwH/Rtwy/siM+ltvlLnEbRjS4RL
 9aF5mfJUazmfCaOBMSaMUo92uCbciIVif6icX843JmCdCOR5Hk5SZryBbt2A/dF9
 TcMloKNbIn/ad7yZ0O75BJlPnRJ5RZ42edQfW1lkdeWo644C8Kj399fVPt7KU5SH
 1KTWyShT05E2fYjp2lMrb+FOFfKerlzkXtgGwJKXnd/7hrbdmKEH/OO8YkMrlVZp
 SURPyzNMMVKoUFY797b6FrFRqV04C210BtNcNrd4S3/V9VE4IPS/8YSLfvVaGkD0
 e2kVAvIOkwPnYzMZg70jf2R8NlGS2mwaVC+NenBHz3KlpFdaeRu1hFw7/n8h2/s=
 =YbJd
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "In the past, I've had lots of reports about trace events not working.
  Developers would say they put a trace_printk() before and after the
  trace event but when they enable it (and the trace event said it was
  enabled) they would see the trace_printks but not the trace event.

  I was not able to reproduce this, but that's because I wasn't looking
  at the right location.  Recently, another bug came up that showed the
  issue.

  If your kernel supports signed modules but allows for non-signed
  modules to be loaded, then when one is, the kernel will silently set
  the MODULE_FORCED taint on the module.  Although, this taint happens
  without the need for insmod --force or anything of the kind, it labels
  the module with that taint anyway.

  If this tainted module has tracepoints, the tracepoints will be
  ignored because of the MODULE_FORCED taint.  But no error message will
  be displayed.  Worse yet, the event infrastructure will still be
  created letting users enable the trace event represented by the
  tracepoint, although that event will never actually be enabled.  This
  is because the tracepoint infrastructure allows for non-existing
  tracepoints to be enabled for new modules to arrive and have their
  tracepoints set.

  Although there are several things wrong with the above, this change
  only addresses the creation of the trace event files for tracepoints
  that are not created when a module is loaded and is tainted.  This
  change will print an error message about the module being tainted and
  not the trace events will not be created, and it does not create the
  trace event infrastructure"

* tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do not add event files for modules that fail tracepoints
2014-03-07 16:32:40 -08:00
Linus Torvalds 27ea0f7811 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 - a bugfix for a long standing waitqueue race
 - a trivial fix for a missing include

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Include missing header file in irqdomain.c
  genirq: Remove racy waitqueue_active check
2014-03-07 16:31:41 -08:00
Richard Guy Briggs f952d10ff4 audit: Use more current logging style again
Add pr_fmt to prefix "audit: " to output
Convert printk(KERN_<LEVEL> to pr_<level>
Coalesce formats

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2014-03-07 11:48:15 -05:00
Eric Paris b7d3622a39 Linux 3.13
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS3IyXAAoJEHm+PkMAQRiGplAH/ilCikBrCHyZ2938NHNLm+j1
 yhfYnEJHLNg7T69KEj3p0cNagO3v9RPWM6UYFBQ6uFIYNN1MBKO7U+mCZuMWzeO8
 +tGMV3mn5wx+oYn1RnWCCweQx5AESEl6rYn8udPDKh7LfW5fCLV60jguUjVSQ9IQ
 cvtKlWknbiHyM7t1GoYgzN7jlPrRQvcNQZ+Aogzz7uSnJAgwINglBAHS7WP2tiEM
 HAU2FoE4b3MbfGaid1vypaYQPBbFebx7Bw2WxAuZhkBbRiUBKlgF0/SYhOTvH38a
 Sjpj1EHKfjcuCBt9tht6KP6H56R25vNloGR2+FB+fuQBdujd/SPa9xDflEMcdG4=
 =iXnG
 -----END PGP SIGNATURE-----

Merge tag 'v3.13' into for-3.15

Linux 3.13

Conflicts:
	include/net/xfrm.h

Simple merge where v3.13 removed 'extern' from definitions and the audit
tree did s/u32/unsigned int/ to the same definitions.
2014-03-07 11:41:32 -05:00
Petr Mladek cd21067f69 ftrace: Warn on error when modifying ftrace function
We should print some warning and kill ftrace functionality when the ftrace
function is not set correctly. Otherwise, ftrace might do crazy things without
an explanation. The error value has been ignored so far.

Note that an error that happens during updating all the traced calls is handled
in ftrace_replace_code(). We print more details about the particular
failing address via ftrace_bug() there.

Link: http://lkml.kernel.org/r/1393258342-29978-3-git-send-email-pmladek@suse.cz

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:15 -05:00
Jiri Slaby 3a36cb11ca ftrace: Do not pass data to ftrace_dyn_arch_init
As the data parameter is not really used by any ftrace_dyn_arch_init,
remove that from ftrace_dyn_arch_init. This also removes the addr
local variable from ftrace_init which is now unused.

Note the documentation was imprecise as it did not suggest to set
(*data) to 0.

Link: http://lkml.kernel.org/r/1393268401-24379-4-git-send-email-jslaby@suse.cz

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:14 -05:00
Jiri Slaby af64a7cb09 ftrace: Pass retval through return in ftrace_dyn_arch_init()
No architecture uses the "data" parameter in ftrace_dyn_arch_init() in any
way, it just sets the value to 0. And this is used as a return value
in the caller -- ftrace_init, which just checks the retval against
zero.

Note there is also "return 0" in every ftrace_dyn_arch_init.  So it is
enough to check the retval and remove all the indirect sets of data on
all archs.

Link: http://lkml.kernel.org/r/1393268401-24379-3-git-send-email-jslaby@suse.cz

Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:13 -05:00
Jiri Slaby c867ccd838 ftrace: Inline the code from ftrace_dyn_table_alloc()
The function used to do allocations some time ago. This no longer
happens and it only checks the count and prints some info. This patch
inlines the body to the only caller. There are two reasons:
* the name of the function was misleading
* it's clear what is going on in ftrace_init now

Link: http://lkml.kernel.org/r/1393268401-24379-2-git-send-email-jslaby@suse.cz

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:12 -05:00
Jiri Slaby 1dc43cf0be ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
Some of them can be local to functions, so make them local and pass
them as parameters where needed:
* __start_mcount_loc+__stop_mcount_loc are local to ftrace_init
* ftrace_new_pgs -> new_pgs/start_pg
* ftrace_update_cnt -> local update_cnt in ftrace_update_code

Link: http://lkml.kernel.org/r/1393268401-24379-1-git-send-email-jslaby@suse.cz

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:12 -05:00
Steven Rostedt b196e2b9e2 tracing: Warn if a tracepoint is not set via debugfs
Tracepoints were made to allow enabling a tracepoint in a module before that
module was loaded. When a tracepoint is enabled and it does not exist, the
name is stored and will be enabled when the tracepoint is created.

The problem with this approach is that when a tracepoint is enabled when
it expects to be there, it gives no warning that it does not exist.

To add salt to the wound, if a module is added and sets the FORCED flag, which
can happen if it isn't signed properly, the tracepoint code will not enabled
the tracepoints, but they will be created in the debugfs system! When a user
goes to enable the tracepoint, the tracepoint code will not see it existing
and will think it is to be enabled later AND WILL NOT GIVE A WARNING.

The tracing will look like it succeeded but will actually be doing nothing.
This will cause lots of confusion and headaches for developers trying to
figure out why they are not seeing their tracepoints.

Link: http://lkml.kernel.org/r/20140213154507.4040fb06@gandalf.local.home

Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:07 -05:00
Steven Rostedt 3fd40d1ee6 tracing: Use helper functions in event assignment to shrink macro size
The functions that assign the contents for the ftrace events are
defined by the TRACE_EVENT() macros. Each event has its own unique
way to assign data to its buffer. When you have over 500 events,
that means there's 500 functions assigning data uniquely for each
event (not really that many, as DECLARE_EVENT_CLASS() and multiple
DEFINE_EVENT()s will only need a single function).

By making helper functions in the core kernel to do some of the work
instead, we can shrink the size of the kernel down a bit.

With a kernel configured with 502 events, the change in size was:

   text    data     bss     dec     hex filename
12987390        1913504 9785344 24686238        178ae9e /tmp/vmlinux
12959102        1913504 9785344 24657950        178401e /tmp/vmlinux.patched

That's a total of 28288 bytes, which comes down to 56 bytes per event.

Link: http://lkml.kernel.org/r/20120810034708.370808175@goodmis.org

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:07 -05:00
Steven Rostedt 35bb4399bd tracing: Move event storage for array from macro to standalone function
The code that shows array fields for events is defined for all events.
This can add up quite a bit when you have over 500 events.

By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.

With a kernel configured with 502 events, the change in size was:

   text    data     bss     dec     hex filename
12990946        1913568 9785344 24689858        178bcc2 /tmp/vmlinux
12987390        1913504 9785344 24686238        178ae9e /tmp/vmlinux.patched

That's a total of 3556 bytes, which comes down to 7 bytes per event.
Although it's not much, this code is just called at initialization of
the events.

Link: http://lkml.kernel.org/r/20120810034708.084036335@goodmis.org

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:06 -05:00
Steven Rostedt 1d6bae966e tracing: Move raw output code from macro to standalone function
The code for trace events to format the raw recorded event data
into human readable format in the 'trace' file is repeated for every
event in the system. When you have over 500 events, this can add up
quite a bit.

By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.

With a kernel configured with 502 events, the change in size was:

   text    data     bss     dec     hex filename
12991007        1913568 9785344 24689919        178bcff /tmp/vmlinux.orig
12990946        1913568 9785344 24689858        178bcc2 /tmp/vmlinux.patched

Note, this version does not save as much as the version of this patch
I had a few years ago. That is because in the mean time, commit
f71130de5c ("tracing: Add a helper function for event print functions")
did a lot of the work my original patch did. But this change helps
slightly, and is part of a larger clean up to reduce the size much further.

Link: http://lkml.kernel.org/r/20120810034707.378538034@goodmis.org

Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-07 10:06:05 -05:00
Heiko Carstens 2f2728f6de mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-03-06 16:30:47 +01:00
Heiko Carstens ca2c405ab9 kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-03-06 16:30:46 +01:00
Heiko Carstens 62a6fa9768 kernel/compat: convert to COMPAT_SYSCALL_DEFINE
Convert all compat system call functions where all parameter types
have a size of four or less than four bytes, or are pointer types
to COMPAT_SYSCALL_DEFINE.
The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
zero and sign extension to 64 bit of all parameters if needed.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-03-06 15:35:10 +01:00
Roman Pen af5040da01 blktrace: fix accounting of partially completed requests
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

  C   R 232 + 240 [0]
  C   R 240 + 232 [0]
  C   R 248 + 224 [0]
  C   R 256 + 216 [0]

but should be:

  C   R 232 + 8 [0]
  C   R 240 + 8 [0]
  C   R 248 + 8 [0]
  C   R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-05 16:11:21 -07:00
Thomas Gleixner 8f945a3325 genirq: Move kstat_incr_irqs_this_cpu() to core
No more users outside the core code. Put it into the poison
cabinet. That also gets rid of the linux/irq.h include in
kernel_stat.h

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140223212739.124207133@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-04 17:37:54 +01:00
Thomas Gleixner 792d0018a5 genirq: Add a kstat helper to increment irq stats
There is a common pattern all over the place:

      kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));

This results in a call to core code anyway. So provide a function
which does the same thing in core.

While at it, replace the butt ugly macro with an inline.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140223212737.422068876@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-04 17:37:53 +01:00
Viresh Kumar 38edbb0b91 timer: Make sure TIMER_FLAG_MASK bits are free in allocated base
Currently we are using two lowest bit of base for internal purpose and
so they both should be zero in the allocated address. The code was
doing the right thing before this patch came in: commit c5f66e99b
(timer: Implement TIMER_IRQSAFE)

Tejun probably forgot to update this piece of code which checks if the
lowest 'n' bits are zero or not and so wasn't updated according to the
new flag. Lets use TIMER_FLAG_MASK in the calculations here.

[ tglx: Massaged changelog ]

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: tj@kernel.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/9144e10d7e854a0aa8a673332adec356d81a923c.1393576981.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-04 12:30:29 +01:00
Viresh Kumar c24a4a3694 timer: Check failure of timer_cpu_notify() before calling init_timer_stats()
timer_cpu_notify() should return NOTIFY_OK and nothing else. Anything else would
trigger a BUG_ON(). Return value of this routine is already checked correctly
but is done after issuing a call to init_timer_stats(). The right order would be
to check the error case first and then call init_timer_stats(). Lets do it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: tj@kernel.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/c439f5b6bbc2047e1662f4d523350531425bcf9d.1393576981.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-04 12:30:29 +01:00
Steven Rostedt (Red Hat) 7dec935a3a tracepoint: Do not waste memory on mods with no tracepoints
No reason to allocate tp_module structures for modules that have no
tracepoints. This just wastes memory.

Fixes: b75ef8b44b "Tracepoint: Dissociate from module mutex"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-03 21:23:08 -05:00
Steven Rostedt (Red Hat) 45ab2813d4 tracing: Do not add event files for modules that fail tracepoints
If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.

Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.

Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org

Fixes: 6d723736e4 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-03-03 21:11:05 -05:00
Li Zefan b8dadcb58d cpuset: use rcu_read_lock() to protect task_cs()
We no longer use task_lock() to protect tsk->cgroups.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-03 17:28:36 -05:00
Linus Torvalds 0c0bd34a14 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes, most of them SCHED_DEADLINE fallout"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Prevent rt_time growth to infinity
  sched/deadline: Switch CPU's presence test order
  sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
  sched: Fix double normalization of vruntime
2014-03-03 10:49:24 -08:00
Heiko Carstens 03b8c7b623 futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-03 11:32:08 +01:00
Rashika Kheria 64e8d20bd3 kernel: Include appropriate header file in time/timekeeping_debug.c
Include appropriate header file kernel/time/timekeeping_internal.h in
kernel/time/timekeeping_debug.c because it has prototype declaration of
function defined in kernel/time/timekeeping_debug.c.

This eliminates the following warning in
kernel/time/timekeeping_debug.c:
kernel/time/timekeeping_debug.c:68:6: warning: no previous prototype for ‘tk_debug_account_sleep_time’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-03-02 20:52:58 -08:00
Linus Torvalds 3154da34be Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes, most of them on the tooling side"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Fix strict alias issue for find_first_bit
  perf tools: fix BFD detection on opensuse
  perf: Fix hotplug splat
  perf/x86: Fix event scheduling
  perf symbols: Destroy unused symsrcs
  perf annotate: Check availability of annotate when processing samples
2014-03-02 11:37:07 -06:00
Eric W. Biederman 6f285b19d0 audit: Send replies in the proper network namespace.
In perverse cases of file descriptor passing the current network
namespace of a process and the network namespace of a socket used by
that socket may differ.  Therefore use the network namespace of the
appropiate socket to ensure replies always go to the appropiate
socket.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-02-28 19:44:55 -08:00
Sebastian Capella 421a5fa1a6 PM / hibernate: use name_to_dev_t to parse resume
Use the name_to_dev_t call to parse the device name echo'd to
to /sys/power/resume.  This imitates the method used in hibernate.c
in software_resume, and allows the resume partition to be specified
using other equivalent device formats as well.  By allowing
/sys/debug/resume to accept the same syntax as the resume=device
parameter, we can parse the resume=device in the init script and
use the resume device directly from the kernel command line.

Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-01 01:02:52 +01:00
Rashika Kheria 6c5be29165 PM / wakeup: Include appropriate header file in kernel/power/wakelock.c
Include appropriate header file kernel/power/power.h in
kernel/power/wakelock.c because it has prototype declaration of function
defined in kernel/power/wakelock.c.

This eliminates the following warning in kernel/power/wakelock.c:
kernel/power/wakelock.c:34:9: warning: no previous prototype for ‘pm_show_wakelocks’ [-Wmissing-prototypes]
kernel/power/wakelock.c:184:5: warning: no previous prototype for ‘pm_wake_lock’ [-Wmissing-prototypes]
kernel/power/wakelock.c:232:5: warning: no previous prototype for ‘pm_wake_unlock’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-01 01:02:09 +01:00
Rashika Kheria 04b7346975 PM / sleep: Move prototype declaration to header file kernel/power/power.h
Move prototype declaration of function to header file
kernel/power/power.h because it is used by more than one file.

This eliminates the following warning in kernel/power/snapshot.c:

kernel/power/snapshot.c:1588:16: warning: no previous prototype for ‘swsusp_save’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-01 01:01:25 +01:00
Ingo Molnar d27c8438ee Merge branch 'timers/core' into sched/idle
Avoid heavy conflicts caused by WIP patches in drivers/cpuidle/cpuidle.c,
by merging these into a single base.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-28 13:58:25 +01:00
Eric W. Biederman 48095d991d audit: Use struct net not pid_t to remember the network namespce to reply in
In struct audit_netlink_list and audit_reply add a reference to the
network namespace of the caller and remove the userspace pid of the
caller.  This cleanly remembers the callers network namespace, and
removes a huge class of races and nasty failure modes that can occur
when attempting to relook up the callers network namespace from a
pid_t (including the caller's network namespace changing, pid
wraparound, and the pid simply not being present).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-02-28 04:04:33 -08:00
Thomas Gleixner bce1936951 Merge branch 'timers.2014.02.25a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into timers/core
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-28 11:53:48 +01:00
Ingo Molnar 62c206bd51 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

	* Update RCU documentation.  These were posted to LKML at
	  https://lkml.org/lkml/2014/2/17/555.

	* Miscellaneous fixes.  These were posted to LKML at
	  https://lkml.org/lkml/2014/2/17/530.  Note that two of these
	  are RCU changes to other maintainer's trees: add1f09954
	  (fs) and 8857563b81 (notifer), both of which substitute
	  rcu_access_pointer() for rcu_dereference_raw().

	* Real-time latency fixes.  These were posted to LKML at
	  https://lkml.org/lkml/2014/2/17/544.

	* Torture-test changes, including refactoring of rcutorture
	  and introduction of a vestigial locktorture.  These were posted
	  to LKML at https://lkml.org/lkml/2014/2/17/599.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-28 08:38:30 +01:00
Rashika Kheria 864f32a52b kernel: Mark function as static in kernel/seccomp.c
Mark function as static in kernel/seccomp.c because it is not used
outside this file.

This eliminates the following warning in kernel/seccomp.c:
kernel/seccomp.c:296:6: warning: no previous prototype for ?seccomp_attach_user_filter? [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2014-02-28 13:54:27 +11:00
Linus Torvalds 8d7531825c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull filesystem fixes from Jan Kara:
 "Notification, writeback, udf, quota fixes

  The notification patches are (with one exception) a fallout of my
  fsnotify rework which went into -rc1 (I've extented LTP to cover these
  cornercases to avoid similar breakage in future).

  The UDF patch is a nasty data corruption Al has recently reported,
  the revert of the writeback patch is due to possibility of violating
  sync(2) guarantees, and a quota bug can lead to corruption of quota
  files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Allocate overflow events with proper type
  fanotify: Handle overflow in case of permission events
  fsnotify: Fix detection whether overflow event is queued
  Revert "writeback: do not sync data dirtied after sync start"
  quota: Fix race between dqput() and dquot_scan_active()
  udf: Fix data corruption on file type conversion
  inotify: Fix reporting of cookies for inotify events
2014-02-27 10:37:22 -08:00
Li Zefan 99afb0fd5f cpuset: fix a race condition in __cpuset_node_allowed_softwall()
It's not safe to access task's cpuset after releasing task_lock().
Holding callback_mutex won't help.

Cc: <stable@vger.kernel.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-27 09:39:54 -05:00
Li Zefan 4729583006 cpuset: fix a locking issue in cpuset_migrate_mm()
I can trigger a lockdep warning:

  # mount -t cgroup -o cpuset xxx /cgroup
  # mkdir /cgroup/cpuset
  # mkdir /cgroup/tmp
  # echo 0 > /cgroup/tmp/cpuset.cpus
  # echo 0 > /cgroup/tmp/cpuset.mems
  # echo 1 > /cgroup/tmp/cpuset.memory_migrate
  # echo $$ > /cgroup/tmp/tasks
  # echo 1 > /cgruop/tmp/cpuset.mems

  ===============================
  [ INFO: suspicious RCU usage. ]
  3.14.0-rc1-0.1-default+ #32 Not tainted
  -------------------------------
  include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
  ...
    [<ffffffff81582174>] dump_stack+0x72/0x86
    [<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140
    [<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0
  ...

We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
we hold cpuset_mutex, which causes task_css() to complain.

This is not a false-positive but a real issue.

Holding cpuset_mutex won't prevent a task from migrating to another
cpuset, and it won't prevent the original task->cgroup from destroying
during this change.

Fixes: 5d21cc2db0 (cpuset: replace cgroup_mutex locking with cpuset internal locking)
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Li Zefan <lizefan@huawei.com>
Sigend-off-by: Tejun Heo <tj@kernel.org>
2014-02-27 09:35:59 -05:00
Rashika Kheria 64be38ab03 genirq: Include missing header file in irqdomain.c
Include appropriate header file include/linux/of_irq.h in
kernel/irq/irqdomain.c because it contains prototype definition of
function define in kernel/irq/irqdomain.c.

This eliminates the following warning in kernel/irq/irqdomain.c:
kernel/irq/irqdomain.c:468:14: warning: no previous prototype for ‘irq_create_of_mapping’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/eb89aebea7ff1a46122918ac389ebecf8248be9a.1393493276.git.rashika.kheria@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-27 13:29:35 +01:00
Peter Zijlstra 4a2345937c perf: Optimize group_sched_in()
Use the ctx pmu instead of the event pmu.

When a group leader is a software event but the group contains
hardware events, the entire group is on the hardware PMU.

Using the hardware PMU for the transaction makes most sense since
that's the most expensive one to programm (and software PMUs generally
don't have TXN support anyway).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-sctoo9t2f3nn2c9g568928q3@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:43:26 +01:00
Mark Rutland fdded676c3 perf: Remove redundant PMU assignment
Currently perf_branch_stack_sched_in iterates over the set of pmus,
checks that each pmu has a flush_branch_stack callback, then overwrites
the pmu before calling the callback. This is either redundant or broken.

In systems with a single hw pmu, pmu == cpuctx->ctx.pmu, and thus the
assignment is redundant.

In systems with multiple hw pmus (i.e. multiple pmus with task_ctx_nr ==
perf_hw_context) the pmus share the same perf_cpu_context. Thus the
assignment can cause one of the pmus to flush its branch stack
repeatedly rather than causing each of the pmus to flush their branch
stacks. Worse still, if only some pmus have the callback the assignment
can result in a branch to NULL.

This patch removes the redundant assignment.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1392054264-23570-3-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:43:24 +01:00
Mark Rutland 9e3170411e perf: Fix prototype of find_pmu_context()
For some reason find_pmu_context() is defined as returning void * rather
than a __percpu struct perf_cpu_context *. As all the requisite types are
defined in advance there's no reason to keep it that way.

This patch modifies the prototype of pmu_find_context to return a
__percpu struct perf_cpu_context *.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1392054264-23570-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:43:21 +01:00
Ingo Molnar ff5a7088f0 Merge branch 'perf/urgent' into perf/core
Merge the latest fixes before queueing up new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:41:17 +01:00
Dongsheng Yang 2b3942e4bb trace: Replace hardcoding of 19 with MAX_NICE
Use MAX_NICE instead of the value 19 for ring_buffer_benchmark.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1393251121-25534-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:41:03 +01:00
Peter Zijlstra 37e117c07b sched: Guarantee task priority in pick_next_task()
Michael spotted that the idle_balance() push down created a task
priority problem.

Previously, when we called idle_balance() before pick_next_task() it
wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
task slipped in.

Similarly for pre_schedule(), rt pre-schedule could have a dl task
slip in.

But by pulling it into the pick_next_task() loop, we'll not try a
higher task priority again.

Cure this by creating a re-start condition in pick_next_task(); and
triggering this from pick_next_task_{rt,fair}().

It also fixes a live-lock where we get stuck in pick_next_task_fair()
due to idle_balance() seeing !0 nr_running but there not actually
being any fair tasks about.

Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Fixes: 38033c37fa ("sched: Push down pre_schedule() and idle_balance()")
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:41:02 +01:00
Peter Zijlstra 06d50c65b1 sched/idle: Remove stale old file
Commit cf37b6b484 ("sched/idle: Move cpu/idle.c to sched/idle.c")
said to simply move a file; somehow it got mangled and created an old
version of the file and forgot to remove the old file.

Fix this fail; add the lost change and remove the now identical old
file.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/20140224172207.GC9987@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:41:01 +01:00
Dietmar Eggemann f5f9739d7a sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
The struct sched_avg of struct rq is only used in case group
scheduling is enabled inside __update_tg_runnable_avg() to update
per-cpu representation of a task group.  I.e. that there is no need to
maintain the runnable avg of a rq in the !CONFIG_FAIR_GROUP_SCHED case.

This patch guards struct sched_avg of struct rq and
update_rq_runnable_avg() with CONFIG_FAIR_GROUP_SCHED.

There is an extra empty definition for update_rq_runnable_avg()
necessary for the !CONFIG_FAIR_GROUP_SCHED && CONFIG_SMP case.

The function print_cfs_group_stats() which prints out struct sched_avg
of struct rq is already guarded with CONFIG_FAIR_GROUP_SCHED.

Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/530DCDC5.1060406@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:41:00 +01:00
Peter Zijlstra e3703f8cdf perf: Fix hotplug splat
Drew Richardson reported that he could make the kernel go *boom* when hotplugging
while having perf events active.

It turned out that when you have a group event, the code in
__perf_event_exit_context() fails to remove the group siblings from
the context.

We then proceed with destroying and freeing the event, and when you
re-plug the CPU and try and add another event to that CPU, things go
*boom* because you've still got dead entries there.

Reported-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:38:03 +01:00
Juri Lelli faa5993736 sched/deadline: Prevent rt_time growth to infinity
Kirill Tkhai noted:

  Since deadline tasks share rt bandwidth, we must care about
  bandwidth timer set. Otherwise rt_time may grow up to infinity
  in update_curr_dl(), if there are no other available RT tasks
  on top level bandwidth.

RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).

Peter then proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.

Indeed, Kirill noted again:

  It looks we may get into a situation, when all CPU time is shared
  between RT and DL tasks:

  rt_runtime = n
  rt_period  = 2n

  | RT working, DL sleeping  | DL working, RT sleeping      |
  -----------------------------------------------------------
  | (1)     duration = n     | (2)     duration = n         | (repeat)
  |--------------------------|------------------------------|
  | (rt_bw timer is running) | (rt_bw timer is not running) |

  No time for fair tasks at all.

While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!

What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.

This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).

This cures both problems:

 - no matter how many DL instances in the past, you'll have an rt_time
   slightly above rt_runtime when an RT task is enqueued, and from that
   point on (after the first replenishment), the task will normally execute;

 - you can still eat up all bandwidth during the first period, but not
   anymore after that, remember that DL execution will increment rt_time
   till the upper limit is reached.

The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.

Reported-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:41 +01:00
Juri Lelli eec751ed41 sched/deadline: Switch CPU's presence test order
Commit 82b9580 ("sched/deadline: Test for CPU's presence explicitly")
changed how we check if a CPU returned by cpudeadline machinery is
valid. But, we don't want to call cpu_present() if best_cpu is
equal to -1. So, switch the order of tests inside WARN_ON().

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: boris.ostrovsky@oracle.com
Cc: konrad.wilk@oracle.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:40 +01:00
Kirill Tkhai 3908ac13b5 sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
In deadline class we do not have group scheduling.

So, let's remove unnecessary

	X = X;

equations.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:39 +01:00
George McCollister 791c9e0292 sched: Fix double normalization of vruntime
dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Signed-off-by: George McCollister <george.mccollister@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:38 +01:00
Chuansheng Liu c685689fd2 genirq: Remove racy waitqueue_active check
We hit one rare case below:

T1 calling disable_irq(), but hanging at synchronize_irq()
always;
The corresponding irq thread is in sleeping state;
And all CPUs are in idle state;

After analysis, we found there is one possible scenerio which
causes T1 is waiting there forever:
CPU0                                       CPU1
 synchronize_irq()
  wait_event()
    spin_lock()
                                           atomic_dec_and_test(&threads_active)
      insert the __wait into queue
    spin_unlock()
                                           if(waitqueue_active)
    atomic_read(&threads_active)
                                             wake_up()

Here after inserted the __wait into queue on CPU0, and before
test if queue is empty on CPU1, there is no barrier, it maybe
cause it is not visible for CPU1 immediately, although CPU0 has
updated the queue list.
It is similar for CPU0 atomic_read() threads_active also.

So we'd need one smp_mb() before waitqueue_active.that, but removing
the waitqueue_active() check solves it as wel l and it makes
things simple and clear.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Cc: Xiaoming Wang <xiaoming.wang@intel.com>
Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-27 10:54:16 +01:00
Bjorn Helgaas 5edb93b89f resource: Add resource_contains()
We have two identical copies of resource_contains() already, and more
places that could use it.  This moves it to ioport.h where it can be
shared.

resource_contains(struct resource *r1, struct resource *r2) returns true
iff r1 and r2 are the same type (most callers already checked this
separately) and the r1 address range completely contains r2.

In addition, the new resource_contains() checks that both r1 and r2 have
addresses assigned to them.  If a resource is IORESOURCE_UNSET, it doesn't
have a valid address and can't contain or be contained by another resource.
Some callers already check this or for res->start.

No functional change.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-02-26 14:42:09 -07:00
Paul E. McKenney f5604f67fe Merge branch 'torture.2014.02.23a' into HEAD
torture.2014.02.23a: locktorture addition and rcutorture changes
2014-02-26 06:38:59 -08:00
Paul E. McKenney 322efba5b6 Merge branches 'doc.2014.02.24a', 'fixes.2014.02.26a' and 'rt.2014.02.17b' into HEAD
doc.2014.02.24a: Documentation changes
fixes.2014.02.26a: Miscellaneous fixes
rt.2014.02.17b: Response-time-related changes
2014-02-26 06:36:09 -08:00
Paul Gortmaker 5cb5c6e18f rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
The kbuild test bot uncovered an implicit dependence on the
trace header being present before rcu.h in ia64 allmodconfig
that looks like this:

In file included from kernel/ksysfs.c:22:0:
kernel/rcu/rcu.h: In function '__rcu_reclaim':
kernel/rcu/rcu.h:107:3: error: implicit declaration of function 'trace_rcu_invoke_kfree_callback' [-Werror=implicit-function-declaration]
kernel/rcu/rcu.h:112:3: error: implicit declaration of function 'trace_rcu_invoke_callback' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors

Looking at other rcu.h users, we can find that they all
were sourcing the trace header in advance of rcu.h itself,
as seen in the context of this diff.  There were also some
inconsistencies as to whether it was or wasn't sourced based
on the parent tracing Kconfig.

Rather than "fix" it at each use site, and have inconsistent
use based on whether "#ifdef CONFIG_RCU_TRACE" was used or not,
lets just source the trace header just once, in the actual consumer
of it, which is rcu.h itself.  We include it unconditionally, as
build testing shows us that is a hard requirement for some files.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-26 06:35:18 -08:00
Paul Gortmaker 7a75474318 rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
This commit fixes the follwoing warning:

kernel/ksysfs.c:143:5: warning: symbol 'rcu_expedited' was not declared. Should it be static?

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
[ paulmck: Moved the declaration to include/linux/rcupdate.h to avoid
	   including the RCU-internal rcu.h file outside of RCU. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-26 06:35:16 -08:00
Paul E. McKenney 8857563b81 notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
(Trivial patch.)

If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded
to rcu_access_pointer().  This commit makes this downgrade in
__blocking_notifier_call_chain() which simply compares the RCU-protected
pointer against NULL with no dereferencing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-26 06:35:13 -08:00
Vijaya Kumar K d498d4b47f KGDB: make kgdb_breakpoint() as noinline
The function kgdb_breakpoint() sets up break point at
compile time by calling arch_kgdb_breakpoint();
Though this call is surrounded by wmb() barrier,
the compile can still re-order the break point,
because this scheduling barrier is not a code motion
barrier in gcc.

Making kgdb_breakpoint() as noinline solves this problem
of code reording around break point instruction and also
avoids problem of being called as inline function from
other places

More details about discussion on this can be found here
http://comments.gmane.org/gmane.linux.ports.arm.kernel/269732

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-02-26 11:16:25 +00:00
Oleg Nesterov aea369b959 timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
The internal_add_timer() function updates base->next_timer only if
timer->expires < base->next_timer. This is correct, but it also makes
sense to do the same if we add the first non-deferrable timer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
2014-02-25 12:39:01 -08:00
Paul E. McKenney 18d8cb64c9 timers: Reduce future __run_timers() latency for first add to empty list
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel.  However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
Therefore, just before we add a timer to an empty timer wheel, we should
mark the timer wheel as being up to date.  This marking will reduce (and
perhaps eliminate) the jiffy-stepping that a future __run_timers() call
will need to do in response to some future timer posting or migration.
This commit therefore updates ->timer_jiffies for this case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
2014-02-25 12:39:00 -08:00
Paul E. McKenney 16d937f880 timers: Reduce future __run_timers() latency for newly emptied list
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel.  However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
Therefore, if we just emptied the timer wheel, for example, by deleting
the last timer, we should mark the timer wheel as being up to date.
This marking will reduce (and perhaps eliminate) the jiffy-stepping that
a future __run_timers() call will need to do in response to some future
timer posting or migration.  This commit therefore catches ->timer_jiffies
for this case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
2014-02-25 12:39:00 -08:00
Paul E. McKenney d550e81dc0 timers: Reduce __run_timers() latency for empty list
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel.  However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
In this case, which is likely to be common for NO_HZ_FULL kernels, the
kernel currently incurs a large latency for no good reason.  This commit
therefore short-circuits this case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
2014-02-25 12:39:00 -08:00
Paul E. McKenney fff421580f timers: Track total number of timers in list
Currently, the tvec_base structure's ->active_timers field tracks only
the non-deferrable timers, which means that even if ->active_timers is
zero, there might well be deferrable timers in the list.  This commit
therefore adds an ->all_timers field to track all the timers, whether
deferrable or not.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
2014-02-25 12:38:59 -08:00
Tejun Heo a60bed296a cgroup_freezer: document freezer_fork() subtleties
cgroup_subsys->fork() callback is special in that it's called outside
the usual cgroup locking and may race with on-going migration.
freezer_fork() currently doesn't consider such race condition;
however, it is still correct thanks to the fact that freeze_task() may
be called spuriously.

This is quite subtle.  Let's explain what's going on and add test to
detect racing and losing to task migration and skip freeze_task() in
such cases for documentation.

This doesn't make any behavior difference meaningful to userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
2014-02-25 10:04:40 -05:00
Tejun Heo 952aaa1254 cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup_transfer_tasks() can currently fail in the middle due to memory
allocation failure.  When that happens, the function just aborts and
returns error code and there's no way to tell how many actually got
migrated at the point of failure and or to revert the partial
migration.

Update it to use cgroup_migrate{_add_src|prepare_dst|migrate|finish}()
so that the function either succeeds or fails as a whole as long as
->can_attach() doesn't fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:03 -05:00
Tejun Heo 0e1d768f1b cgroup: drop task_lock() protection around task->cgroups
For optimization, task_lock() is additionally used to protect
task->cgroups.  The optimization is pretty dubious as either
css_set_rwsem is grabbed anyway or PF_EXITING already protects
task->cgroups.  It adds only overhead and confusion at this point.
Let's drop task_[un]lock() and update comments accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:03 -05:00
Tejun Heo eaf797abc5 cgroup: update how a newly forked task gets associated with css_set
When a new process is forked, cgroup_fork() associates it with the
css_set of its parent but doesn't link it into it.  After the new
process is linked to tasklist, cgroup_post_fork() does the linking.

This is problematic for cgroup_transfer_tasks() as there's no way to
tell whether there are tasks which are pointing to a css_set but not
linked yet.  It is impossible to implement an operation which transfer
all tasks of a cgroup to another and the current
cgroup_transfer_tasks() can easily be tricked into leaving a newly
forked process behind if it gets called between cgroup_fork() and
cgroup_post_fork().

Let's make association with a css_set and linking atomic by moving it
to cgroup_post_fork().  cgroup_fork() sets child->cgroups to
init_css_set as a placeholder and cgroup_post_fork() is updated to
perform both the association with the parent's cgroup and linking
there.  This means that a newly created task will point to
init_css_set without holding a ref to it much like what it does on the
exit path.  Empty cg_list is used to indicate that the task isn't
holding a ref to the associated css_set.

This fixes an actual bug with cgroup_transfer_tasks(); however, I'm
not marking it for -stable.  The whole thing is broken in multiple
other ways which require invasive updates to fix and I don't think
it's worthwhile to bother with backporting this particular one.
Fortunately, the only user is cpuset and these bugs don't crash the
machine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:03 -05:00
Tejun Heo 1958d2d53d cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.

This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.

This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.

The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.

Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:03 -05:00
Tejun Heo ceb6a081f6 cgroup: separate out cset_group_from_root() from task_cgroup_from_root()
This will be used by the planned migration path update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:02 -05:00
Tejun Heo b3dc094e93 cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.

* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.

* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.

  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.

To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.

Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.

To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.

This resolves the above listed issues with moderate additional
complexity.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:01 -05:00
Tejun Heo c75611282c cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.

* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.

* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.

  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.

To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.

In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 10:04:01 -05:00
Tejun Heo f153ad11bc Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
Pull in for-3.14-fixes to receive 532de3fc72 ("cgroup: update
cgroup_enable_task_cg_lists() to grab siglock") which conflicts with
afeb0f9fd4 ("cgroup: relocate cgroup_enable_task_cg_lists()") and
the following cg_lists updates.  This is likely to cause further
conflicts down the line too, so let's merge it early.

As cgroup_enable_task_cg_lists() is relocated in for-3.15, this merge
causes conflict in the original position.  It's resolved by applying
siglock changes to the updated version in the new location.

Conflicts:
	kernel/cgroup.c

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-25 09:56:49 -05:00
Frederic Weisbecker c46fff2a3b smp: Rename __smp_call_function_single() to smp_call_function_single_async()
The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().

The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.

As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.

Lets rename the function and enhance the comments so that they reflect
these properties.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:47:15 -08:00
Frederic Weisbecker fce8ad1568 smp: Remove wait argument from __smp_call_function_single()
The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.

This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.

This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.

Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:

1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
   in an object. It looks like a nice and convenient pattern at the first
   sight because we can then retrieve the object from the IPI handler easily.

   But actually it is a waste of memory space in the object since the csd
   can be allocated from the stack by smp_call_function_single(wait = 1)
   and the object can be passed an the IPI argument.

   Besides that, embedding the csd in an object is more error prone
   because the caller must take care of the serialization of the IPIs
   for this csd.

2) The user calls __smp_call_function_single(wait = 1) on a csd that
   is allocated on the stack. It's ok but smp_call_function_single()
   can do it as well and it already takes care of the allocation on the
   stack. Again it's more simple and less error prone.

Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.

There was a single user of that which has just been converted.
So lets remove this option to discourage further users.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:47:09 -08:00
Frederic Weisbecker e0a23b0628 watchdog: Simplify a little the IPI call
In order to remotely restart the watchdog hrtimer, update_timers()
allocates a csd on the stack and pass it to __smp_call_function_single().

There is no partcular need, however, for a specific csd here. Lets
simplify that a little by calling smp_call_function_single()
which can already take care of the csd allocation by itself.

Acked-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:47:05 -08:00
Frederic Weisbecker d7877c03f1 smp: Move __smp_call_function_single() below its safe version
Move this function closer to __smp_call_function_single(). These functions
have very similar behavior and should be displayed in the same block
for clarity.

Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:47:01 -08:00
Frederic Weisbecker 8b28499a71 smp: Consolidate the various smp_call_function_single() declensions
__smp_call_function_single() and smp_call_function_single() share some
code that can be factorized: execute inline when the target is local,
check if the target is online, lock the csd, call generic_exec_single().

Lets move the common parts to generic_exec_single().

Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:46:58 -08:00
Jan Kara 08eed44c72 smp: Teach __smp_call_function_single() to check for offline cpus
Align __smp_call_function_single() with smp_call_function_single() so
that it also checks whether requested cpu is still online.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:46:55 -08:00
Jan Kara 5fd77595ec smp: Iterate functions through llist_for_each_entry_safe()
The IPI function llist iteration is open coded. Lets simplify this
with using an llist iterator.

Also we want to keep the iteration safe against possible
csd.llist->next value reuse from the IPI handler. At least the block
subsystem used to do such things so lets stay careful and use
llist_for_each_entry_safe().

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24 14:46:47 -08:00
Joe Perches f5645d3575 capability: Use current logging styles
Prefix logging output with "capability: " via pr_fmt.
Convert printks to pr_<level>.
Use pr_<level>_once instead of guard flags.
Coalesce formats.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2014-02-24 14:44:53 +11:00
Linus Torvalds b2880eb83d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "Serialize the registration of a new sched_clock in the currently ARM
  only generic sched_clock facilty to avoid sched_clock havoc"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched_clock: Prevent callers from seeing half-updated data
2014-02-23 14:17:08 -08:00
Paul E. McKenney e086481baf rcutorture: Add a lock_busted to test the test
This commit adds a maximally broken locking primitive in which
lock acquisition and release are both no-ops.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:43 -08:00
Paul E. McKenney f881825a73 rcutorture: Gracefully handle NULL cleanup hooks
Although most torture tests will have some cleanup hook, it is possible
that one might not.  This commit therefore enables graceful handling of
a NULL cleanup hook during torture-test shutdown.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:39 -08:00
Paul E. McKenney ff20e251c4 rcutorture: Add an rcu_busted to test the test
This commit adds a deliberately buggy RCU implementation into rcutorture
to allow easy checking that rcutorture correctly flags buggy RCU
implementations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:30 -08:00
Paul E. McKenney 0af3fe1efa locktorture: Add a lock-torture kernel module
This commit adds the locking counterpart to rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Make n_lock_torture_errors and torture_spinlock static
  as suggested by Fengguang Wu. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:29 -08:00
Paul E. McKenney bfefc73aa1 rcutorture: Stop generic kthreads in torture_cleanup()
The specific torture modules (like rcutorture) need to call
torture_cleanup() in any case, so this commit makes torture_cleanup()
deal with torture_shutdown_cleanup() and torture_stutter_cleanup() so
that the specific modules don't have to deal with these details.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:27 -08:00
Paul E. McKenney 9c029b8609 rcutorture: Abstract torture_stop_kthread()
Stopping of kthreads is not RCU-specific, so this commit abstracts
out torture_stop_kthread(), saving a few lines of code in the process.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:04:25 -08:00
Paul E. McKenney 47cf29b9e7 rcutorture: Abstract torture_create_kthread()
Creation of kthreads is not RCU-specific, so this commit abstracts
out torture_create_kthread(), saving a few tens of lines of code in
the process.

This change requires modifying VERBOSE_TOROUT_ERRSTRING() to take a
non-const string, so that _torture_create_kthread() can avoid an
open-coded substitute.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:24 -08:00
Paul E. McKenney bc8f83e2c0 rcutorture: Fix missing-return bug in rcu_torture_barrier_init()
This commit adds a missing error return to the code path that creates
the rcu_torture_barrier() kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:22 -08:00
Paul E. McKenney 7fafaac5b9 rcutorture: Fix rcutorture shutdown races
Not all of the rcutorture kthreads waited for kthread_should_stop()
before returning from their top-level functions, and none of them
used torture_shutdown_absorb() properly.  These problems can result in
segfaults and hangs at shutdown time, and some recent changes perturbed
timing sufficiently to make them much more probable.  This commit
therefore creates a torture_kthread_stopping() function that does the
proper kthread shutdown dance in one centralized location.

Accommodate this grouping by making VERBOSE_TOROUT_STRING() capable of
taking a non-const string as its argument, which allows the new
torture_kthread_stopping() to pass its "title" argument directly to
the updated version of VERBOSE_TOROUT_STRING().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:03:21 -08:00
Paul E. McKenney 14562d1cf1 rcutorture: Announce task creation
A few "stealth-start rcutorture kthreads" have accumulated over the years,
so this commit adds console-log announcements (but only if the torture
tests are running verbose).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:20 -08:00
Paul E. McKenney 01025ebc99 rcutorture: Clean up rcu_torture_init() error checking
This commit applies some simple cleanups to rcu_torture_init() error
checking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:19 -08:00
Paul E. McKenney e991dbc077 rcutorture: Abstract torture_shutdown()
Because auto-shutdown of torture testing is not specific to RCU,
this commit moves the auto-shutdown function to kernel/torture.c.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:18 -08:00
Paul E. McKenney 57a2fe90fc rcutorture: Apply ACCESS_ONCE() to racy fullstop accesses
Because the fullstop variable can be accessed while it is being updated,
this commit avoids any resulting compiler mischief through use of
ACCESS_ONCE() for non-initialization accesses to this shared variable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:03:16 -08:00
Paul E. McKenney 628edaa506 rcutorture: Abstract stutter_wait()
Because stuttering the test load (stopping and restarting it) is useful
for non-RCU testing, this commit moves the load-stuttering functionality
to kernel/torture.c.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:02:54 -08:00
Paul E. McKenney fac480efcb rcutorture: Add diagnostic for unscheduled system shutdown
Currently, rcutorture can terminate via rmmod, via self-shutdown,
via something else shutting the system down, or of course the usual
catastrophic termination.  The first two get flagged, so this commit adds
a message for the third.  For the fourth, your warranty is void as always.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:13 -08:00
Paul E. McKenney 36970bb91d rcutorture: Privatize fullstop
This commit introduces the torture_must_stop() function in order to
keep use of the fullstop variable local to kernel/torture.c.  There
is also a torture_must_stop_irq() counterpart for use from RCU callbacks,
timeout handlers, and the like.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:12 -08:00
Paul E. McKenney 4622b487ec rcutorture: Abstract torture_shutdown_notify()
Because handling the race between rmmod and system shutdown is not
specific to RCU, this commit abstracts torture_shutdown_notify(),
placing this code into kernel/torture.c.  This change also allows
fullstop_mutex to be private to kernel/torture.c.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:11 -08:00
Paul E. McKenney cc47ae0830 rcutorture: Abstract torture-test cleanup
This commit creates a torture_cleanup() that handles the generic
cleanup actions local to kernel/torture.c.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:08 -08:00
Paul E. McKenney b5daa8f3b3 rcutorture: Abstract torture-test initialization
This commit creates torture_init_begin() and torture_init_end() functions
to abstract locking and allow the torture_type and verbose variables
in kernel/torture.o to become static.  With a bit more abstraction,
fullstop_mutex will also become static.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:07 -08:00
Paul E. McKenney 2e9e8081d2 rcutorture: Abstract torture_onoff()
Because online/offline torturing is not specific to RCU, this commit
abstracts it into the kernel/torture.c module to allow other torture
tests to use it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:06 -08:00
Paul E. McKenney 3808dc9fab rcutorture: Abstract torture_shuffle()
The torture_shuffle() function forces each CPU in turn to go idle
periodically in order to check for problems interacting with per-CPU
variables and with dyntick-idle mode.  Because this sort of debugging
is not specific to RCU, this commit abstracts that functionality.
This in turn requires abstracting some additional infrastructure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:05 -08:00
Paul E. McKenney f67a33561e rcutorture: Abstract torture_shutdown_absorb()
Because handling races between rmmod and normal shutdown is not specific
to rcutorture, this commit renames rcutorture_shutdown_absorb() to
torture_shutdown_absorb() and pulls it out into then kernel/torture.c
module.  This implies pulling the fullstop mechanism into kernel/torture.c
as well.

The exporting of fullstop and fullstop_mutex is ugly and must die.
And it does in fact die in later commits that introduce higher-level
APIs that encapsulate both of these variables.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>`
2014-02-23 09:01:04 -08:00
Paul E. McKenney c2884de38e rcutorture: Abstract TOROUT_STRING() and friends
These diagnostic macros are not confined to torturing RCU, so this commit
makes them available to other torture tests.  Also removed the do-while
from TOROUT_STRING() in response to checkpatch complaints.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:01:02 -08:00
Paul E. McKenney 5ccf60f23d rcutorture: Rename PRINTK to TOROUT
Since it doesn't do printk()s anymore anyway, this commit renames these
macros from PRINTK to TOROUT (short for torture output).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-23 09:01:01 -08:00
Paul E. McKenney 9e25022541 rcutorture: Abstract torture_param()
Create a torture_param() macro and apply it to rcutorture in order to
save a few lines of code.  This same macro may be applied to other
torture frameworks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:01:00 -08:00
Paul E. McKenney 51b1130eb5 rcutorture: Abstract rcu_torture_random()
Because rcu_torture_random() will be used by the locking equivalent to
rcutorture, pull it out into its own module.  This new module cannot
be separately configured, instead, use the Kconfig "select" statement
from the Kconfig options of tests depending on it.

Suggested-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:00:58 -08:00
Paul E. McKenney 806274c018 rcutorture: Fix checkpatch complaint
This commit does a code-style cleanup so that the first curly brace
of an initializer does not appear at the beginning of a line.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-02-23 09:00:57 -08:00
Mike Galbraith d987fc7f32 sched, nohz: Exclude isolated cores from load balancing
The user explicitly disabled load balancing, else this core would not be
disconnected.  Don't add these to nohz.idle_cpus_mask.

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Lei Wen <leiwen@marvell.com>
Link: http://lkml.kernel.org/n/tip-vmme4f49psirp966pklm5l9j@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:17:22 +01:00
Morten Rasmussen de91b9cb97 sched: Fix select_task_rq_fair() description comments
Brings select_task_rq_fair() description comments up-to-date.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392732864-10927-1-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:17:04 +01:00
Dongsheng Yang 144818422b workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/6d85138180c00ce86975addab6e34b24b84f00a5.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:16:41 +01:00
Dongsheng Yang c4a4d2f431 sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:16:19 +01:00
Dongsheng Yang 75e45d512f sched: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/bd80780f19b4f9b4a765acc353c8dbc130274dd6.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:15:54 +01:00
Dongsheng Yang d277d868da rcu: Use MAX_NICE to replace hardcoding of 19
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/5b3bf232f41b33ab703a1595e94671b303e2d1fc.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:15:24 +01:00
Li Zefan 11c785b79e sched/rt: Make init_sched_rt_calss() __init
It's a bootstrap function.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52F5CC09.1080502@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:11:10 +01:00
Li Zefan d82fd25356 sched/rt: Remove 'leaf_rt_rq_list' from 'struct rq'
This is a leftover from commit e23ee74777
("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:10:43 +01:00
Thomas Gleixner c365c292d0 sched: Consider pi boosting in setscheduler()
If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.

If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:10:04 +01:00
Thomas Gleixner 81a44c5441 sched: Queue RT tasks to head when prio drops
The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:09:41 +01:00
Thomas Gleixner d6b1e91197 sched: Adjust p->sched_reset_on_fork when nothing else changes
If the policy and priority remain unchanged a possible modification of
p->sched_reset_on_fork gets lost in the early exit path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:08:43 +01:00
Thomas Gleixner 8f47b1871b sched: Add better debug output for might_sleep()
might_sleep() can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:08:08 +01:00
Thomas Gleixner db273be2a7 sched: Check for idle task in might_sleep()
Idle is not allowed to call sleeping functions ever!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:07:50 +01:00
Thomas Gleixner 77177856e3 sched: Init idle->on_rq in init_idle()
We stumbled in RT over a SMP bringup issue on ARM where the
idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run
into nada land.

After adding that idle->on_rq = 1; I was able to find the root cause
of the lockup: the idle task on the newly woken up cpu was fiddling
with a sleeping spinlock, which is a nono.

I kept the init of idle->on_rq to keep the state consistent and to
avoid another long lasting debug session.

As a side note, the whole debug mess could have been avoided if
might_sleep() would have yelled when called from the idle task. That's
fixed with patch 2/6 - and that one actually has a changelog :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22 18:07:36 +01:00
Peter Zijlstra cd578abb24 perf/x86: Warn to early_printk() in case irq_work is too slow
On Mon, Feb 10, 2014 at 08:45:16AM -0800, Dave Hansen wrote:
> The reason I coded this up was that NMIs were firing off so fast that
> nothing else was getting a chance to run.  With this patch, at least the
> printk() would come out and I'd have some idea what was going on.

It will start spewing to early_printk() (which is a lot nicer to use
from NMI context too) when it fails to queue the IRQ-work because its
already enqueued.

It does have the false-positive for when two CPUs trigger the warn
concurrently, but that should be rare and some extra clutter on the
early printk shouldn't be a problem.

Cc: hpa@zytor.com
Cc: tglx@linutronix.de
Cc: dzickus@redhat.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: mingo@kernel.org
Fixes: 6a02ad66b2 ("perf/x86: Push the duration-logging printk() to IRQ context")
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140211150116.GO27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21 21:49:07 +01:00
Peter Zijlstra dc87734106 sched: Remove some #ifdeffery
Remove a few gratuitous #ifdefs in pick_next_task*().

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21 21:43:18 +01:00