Commit Graph

376 Commits

Author SHA1 Message Date
Zhaolei 57f1f0874f time: add function to convert between calendar time and broken-down time for universal use
There are many similar code in kernel for one object: convert time between
calendar time and broken-down time.

Here is some source I found:
  fs/ncpfs/dir.c
  fs/smbfs/proc.c
  fs/fat/misc.c
  fs/udf/udftime.c
  fs/cifs/netmisc.c
  net/netfilter/xt_time.c
  drivers/scsi/ips.c
  drivers/input/misc/hp_sdc_rtc.c
  drivers/rtc/rtc-lib.c
  arch/ia64/hp/sim/boot/fw-emu.c
  arch/m68k/mac/misc.c
  arch/powerpc/kernel/time.c
  arch/parisc/include/asm/rtc.h
  ...

We can make a common function for this type of conversion, At least we
can get following benefit:

1: Make kernel simple and unify
2: Easy to fix bug in converting code
3: Reduce clone of code in future
   For example, I'm trying to make ftrace display walltime,
   this patch will make me easy.

This code is based on code from glibc-2.6

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:56 -07:00
Linus Torvalds a03fdb7612 Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (34 commits)
  time: Prevent 32 bit overflow with set_normalized_timespec()
  clocksource: Delay clocksource down rating to late boot
  clocksource: clocksource_select must be called with mutex locked
  clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
  timers: Drop a function prototype
  clocksource: Resolve cpu hotplug dead lock with TSC unstable
  timer.c: Fix S/390 comments
  timekeeping: Fix invalid getboottime() value
  timekeeping: Fix up read_persistent_clock() breakage on sh
  timekeeping: Increase granularity of read_persistent_clock(), build fix
  time: Introduce CLOCK_REALTIME_COARSE
  x86: Do not unregister PIT clocksource on PIT oneshot setup/shutdown
  clocksource: Avoid clocksource watchdog circular locking dependency
  clocksource: Protect the watchdog rating changes with clocksource_mutex
  clocksource: Call clocksource_change_rating() outside of watchdog_lock
  timekeeping: Introduce read_boot_clock
  timekeeping: Increase granularity of read_persistent_clock()
  timekeeping: Update clocksource with stop_machine
  timekeeping: Add timekeeper read_clock helper functions
  timekeeping: Move NTP adjusted clock multiplier to struct timekeeper
  ...

Fix trivial conflict due to MIPS lemote -> loongson renaming.
2009-09-18 09:15:24 -07:00
Thomas Gleixner 54a6bc0b07 clocksource: Delay clocksource down rating to late boot
The down rating of clock sources in the early boot process via the
clock source watchdog mechanism can happen way before the per cpu
event queues are initialized. This leads to a boot crash on x86 when
the TSC is marked unstable in the SMP bring up.

The selection of a clock source for time keeping happens in the late
boot process so we can safely delay the list manipulation until
clocksource_done_booting() is called.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-09-14 21:59:32 +02:00
Thomas Gleixner e6c733050f clocksource: clocksource_select must be called with mutex locked
The callers of clocksource_select must hold clocksource_mutex to
protect the clocksource_list.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-09-14 21:59:32 +02:00
Martin Schwidefsky f79e0258ea clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
The watchdog timer is started after the watchdog clocksource
and at least one watched clocksource have been registered. The
clocksource work element watchdog_work is initialized just
before the clocksource timer is started. This is too late for
the clocksource_mark_unstable call from native_cpu_up. To fix
this use a static initializer for watchdog_work.

This resolves a boot crash reported by multiple people.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090911153305.3fe9a361@skybase>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-11 20:17:18 +02:00
Thomas Gleixner 7285dd7fd3 clocksource: Resolve cpu hotplug dead lock with TSC unstable
Martin Schwidefsky analyzed it:
To register a clocksource the clocksource_mutex is acquired and if
necessary timekeeping_notify is called to install the clocksource as
the timekeeper clock. timekeeping_notify uses stop_machine which needs
to take cpu_add_remove_lock mutex.
Starting a new cpu is done with the cpu_add_remove_lock mutex held.
native_cpu_up checks the tsc of the new cpu and if the tsc is no good
clocksource_change_rating is called. Which needs the clocksource_mutex
and the deadlock is complete.

The solution is to replace the TSC via the clocksource watchdog
mechanism. Mark the TSC as unstable and schedule the watchdog work so
it gets removed in the watchdog thread context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
2009-08-28 20:25:24 +02:00
Hiroshi Shimamoto 36d47481b3 timekeeping: Fix invalid getboottime() value
Don't use timespec_add_safe() with wall_to_monotonic, because
wall_to_monotonic has negative values which will cause overflow
in timespec_add_safe(). That makes btime in /proc/stat invalid.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <4A937FDE.4050506@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-25 09:09:02 +02:00
john stultz da15cfdae0 time: Introduce CLOCK_REALTIME_COARSE
After talking with some application writers who want very fast, but not
fine-grained timestamps, I decided to try to implement new clock_ids
to clock_gettime(): CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE
which returns the time at the last tick. This is very fast as we don't
have to access any hardware (which can be very painful if you're using
something like the acpi_pm clocksource), and we can even use the vdso
clock_gettime() method to avoid the syscall. The only trade off is you
only get low-res tick grained time resolution.

This isn't a new idea, I know Ingo has a patch in the -rt tree that made
the vsyscall gettimeofday() return coarse grained time when the
vsyscall64 sysctrl was set to 2. However this affects all applications
on a system.

With this method, applications can choose the proper speed/granularity
trade-off for themselves.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: nikolag@ca.ibm.com
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: arjan@infradead.org
Cc: jonathan@jonmasters.org
LKML-Reference: <1250734414.6897.5.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-21 21:43:46 +02:00
Suresh Siddha f833bab87f clockevent: Prevent dead lock on clockevents_lock
Currently clockevents_notify() is called with interrupts enabled at
some places and interrupts disabled at some other places.

This results in a deadlock in this scenario.

cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
cpu C doing set_mtrr() which will try to rendezvous of all the cpus.

This will result in C and A come to the rendezvous point and waiting
for B. B is stuck forever waiting for the spinlock and thus not
reaching the rendezvous point.

Fix the clockevents code so that clockevents_lock is taken with
interrupts disabled and thus avoid the above deadlock.

Also call lapic_timer_propagate_broadcast() on the destination cpu so
that we avoid calling smp_call_function() in the clockevents notifier
chain.

This issue left us wondering if we need to change the MTRR rendezvous
logic to use stop machine logic (instead of smp_call_function) or add
a check in spinlock debug code to see if there are other spinlocks
which gets taken under both interrupts enabled/disabled conditions.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: "Brown Len" <len.brown@intel.com>
LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-19 18:15:10 +02:00
Martin Schwidefsky 01548f4d3e clocksource: Avoid clocksource watchdog circular locking dependency
stop_machine from a multithreaded workqueue is not allowed because
of a circular locking dependency between cpu_down and the workqueue
execution. Use a kernel thread to do the clocksource downgrade.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090818170942.3ab80c91@skybase>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-19 12:00:56 +02:00
Thomas Gleixner d0981a1b21 clocksource: Protect the watchdog rating changes with clocksource_mutex
Martin pointed out that commit 6ea41d2529 (clocksource: Call
clocksource_change_rating() outside of watchdog_lock) has a
theoretical reference count problem. The calls to
clocksource_change_rating() are now done outside of the clocksource
mutex and outside of the watchdog lock. A concurrent
clocksource_unregister() could remove the clock.

Split out the code which changes the rating from
clocksource_change_rating() into __clocksource_change_rating().

Protect the clocksource_watchdog_work() code sequence with the
clocksource_mutex() and call __clocksource_change_rating().

LKML-Reference: <alpine.LFD.2.00.0908171038420.2782@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-08-19 11:42:48 +02:00
Amerigo Wang de809347ae timers: Drop write permission on /proc/timer_list
/proc/timer_list and /proc/slabinfo are not supposed to be
written, so there should be no write permissions on it.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <20090817094525.6355.88682.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-17 11:47:31 +02:00
Thomas Gleixner 6ea41d252f clocksource: Call clocksource_change_rating() outside of watchdog_lock
The changes to the watchdog logic introduced a lock inversion between
watchdog_lock and clocksource_mutex. Change the rating outside of
watchdog_lock to avoid it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 13:20:42 +02:00
Martin Schwidefsky 23970e389e timekeeping: Introduce read_boot_clock
Add the new function read_boot_clock to get the exact time the system
has been started. For architectures without support for exact boot
time a new weak function is added that returns 0.  Use the exact boot
time to initialize wall_to_monotonic, or xtime if the read_boot_clock
returned 0.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134811.296703241@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:47 +02:00
Martin Schwidefsky d4f587c67f timekeeping: Increase granularity of read_persistent_clock()
The persistent clock of some architectures (e.g. s390) have a
better granularity than seconds. To reduce the delta between the
host clock and the guest clock in a virtualized system change the 
read_persistent_clock function to return a struct timespec.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134811.013873340@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 75c5158f70 timekeeping: Update clocksource with stop_machine
update_wall_time calls change_clocksource HZ times per second to check
if a new clock source is available. In close to 100% of all calls
there is no new clock. Replace the tick based check by an update done
with stop_machine.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134810.711836357@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 2ba2a3054f timekeeping: Add timekeeper read_clock helper functions
Add timekeeper_read_clock_ntp and timekeeper_read_clock_raw and use
them for getnstimeofday, ktime_get, ktime_get_ts and getrawmonotonic.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134810.435105711@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 0a54419836 timekeeping: Move NTP adjusted clock multiplier to struct timekeeper
The clocksource structure has two multipliers, the unmodified multiplier
clock->mult_orig and the NTP corrected multiplier clock->mult. The NTP
multiplier is misplaced in the struct clocksource, this is private
information of the timekeeping code. Add the mult field to the struct
timekeeper to contain the NTP corrected value, keep the unmodifed
multiplier in clock->mult and remove clock->mult_orig.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134810.149047645@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 23ce72117c timekeeping: Add xtime_shift and ntp_error_shift to struct timekeeper
The xtime_nsec value in the timekeeper structure is shifted by a few
bits to improve precision. This happens to be the same value as the
clock->shift. To improve readability add xtime_shift to the timekeeper
and use it instead of the clock->shift. Likewise add ntp_error_shift
and replace all (NTP_SCALE_SHIFT - clock->shift) expressions.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134809.871899606@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 155ec60226 timekeeping: Introduce struct timekeeper
Add struct timekeeper to keep the internal values timekeeping.c needs
in regard to the currently selected clock source. This moves the
timekeeping intervals, xtime_nsec and the ntp error value from struct
clocksource to struct timekeeper. The raw_time is removed from the
clocksource as well. It gets treated like xtime as a global variable.
Eventually xtime raw_time should be moved to struct timekeeper.

[ tglx: minor cleanup ]

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134809.613209842@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky c55c87c892 clocksource: Move watchdog downgrade to a work queue thread
Move the downgrade of an unstable clocksource from the timer interrupt
context into the process context of a work queue thread. This is
needed to be able to do the clocksource switch with stop_machine.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134809.354926067@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky fb63a0ebe6 clocksource: Refactor clocksource watchdog
Refactor clocksource watchdog code to make it more readable. Add
clocksource_dequeue_watchdog to remove a clocksource from the watchdog
list when it is unregistered.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134809.110881699@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 0f8e8ef7c2 clocksource: Simplify clocksource watchdog resume logic
To resume the clocksource watchdog just remove the CLOCK_SOURCE_WATCHDOG
bit from the watched clocksource.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134808.880925790@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 8cf4e750f8 clocksource: Delay clocksource watchdog highres enablement
The clocksource watchdog marks a clock as highres capable before it
checked the deviation from the watchdog clocksource even for a single
time. Make sure that the deviation is at least checked once before
doing the switch to highres mode.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134808.627795883@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky f1b82746c1 clocksource: Cleanup clocksource selection
If a non high-resolution clocksource is first set as override clock
and then registered it becomes active even if the system is in one-shot
mode. Move the override check from sysfs_override_clocksource to the
clocksource selection. That fixes the bug and simplifies the code. The
check in clocksource_register for double registration of the same
clocksource is removed without replacement.

To find the initial clocksource a new weak function in jiffies.c is
defined that returns the jiffies clocksource. The architecture code
can then override the weak function with a more suitable clocksource,
e.g. the TOD clock on s390.

[ tglx: Folded in a fix from John Stultz ]

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134808.388024160@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:46 +02:00
Martin Schwidefsky 1be3967948 timekeeping: Move reset of cycle_last for tsc clocksource to tsc
change_clocksource resets the cycle_last value to zero then sets it to
a value read from the clocksource. The reset to zero is required only
for the TSC clocksource to make the read_tsc function work after a
resume. The reason is that the TSC read function uses cycle_last to
detect backwards going TSCs. In the resume case cycle_last contains
the TSC value from the last update before the suspend. On resume the
TSC starts counting from 0 again and would trip over the cycle_last
comparison.

This is subtle and surprising. Move the reset to a resume function in
the tsc code.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134808.142191175@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:45 +02:00
Martin Schwidefsky a0f7d48bfb timekeeping: Remove clocksource inline functions
The three inline functions clocksource_read, clocksource_enable and
clocksource_disable are simple wrappers of an indirect call plus the
copy from and to the mult_orig value. The functions are exclusively
used by the timekeeping code which has intimate knowledge of the
clocksource anyway. Therefore remove the inline functions. No
functional change.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134807.903108946@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:45 +02:00
John Stultz 31089c13bc timekeeping: Introduce timekeeping_leap_insert
Move the adjustment of xtime, wall_to_monotonic and the update of the
vsyscall variables to the timekeeping code.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20090814134807.609730216@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-15 10:55:45 +02:00
Thomas Gleixner 4cd1993f00 Merge branch 'linus' into timers/core
Reason: Martin's timekeeping cleanup series depends on both
timers/core and mainline changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-14 15:59:30 +02:00
Thomas Gleixner 79ef2bb014 clocksource: Prevent NULL pointer dereference
Writing a zero length string to sys/.../current_clocksource will cause
a NULL pointer dereference if the clock events system is in one shot
(highres or nohz) mode.

Pointed-out-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <alpine.DEB.2.00.0907191545580.12306@bicker>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-19 17:15:54 +02:00
Thomas Gleixner 6ff7041dbf hrtimer: Fix migration expiry check
The timer migration expiry check should prevent the migration of a
timer to another CPU when the timer expires before the next event is
scheduled on the other CPU. Migrating the timer might delay it because
we can not reprogram the clock event device on the other CPU. But the
code implementing that check has two flaws:

- for !HIGHRES the check compares the expiry value with the clock
  events device expiry value which is wrong for CLOCK_REALTIME based
  timers.

- the check is racy. It holds the hrtimer base lock of the target CPU,
  but the clock event device expiry value can be modified
  nevertheless, e.g. by an timer interrupt firing.

The !HIGHRES case is easy to fix as we can enqueue the timer on the
cpu which was selected by the load balancer. It runs the idle
balancing code once per jiffy anyway. So the maximum delay for the
timer is the same as when we keep the tick on the current cpu going.

In the HIGHRES case we can get the next expiry value from the hrtimer
cpu_base of the target CPU and serialize the update with the cpu_base
lock. This moves the lock section in hrtimer_interrupt() so we can set
next_event to KTIME_MAX while we are handling the expired timers and
set it to the next expiry value after we handled the timers under the
base lock. While the expired timers are processed timer migration is
blocked because the expiry time of the timer is always <= KTIME_MAX.

Also remove the now useless clockevents_get_next_event() function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-10 17:32:55 +02:00
Thomas Gleixner a40f262cc2 timekeeping: Move ktime_get() functions to timekeeping.c
The ktime_get() functions for GENERIC_TIME=n are still located in
hrtimer.c. Move them to time/timekeeping.c where they belong.

LKML-Reference: <new-submission>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-07 13:00:31 +02:00
Martin Schwidefsky 951ed4d36b timekeeping: optimized ktime_get[_ts] for GENERIC_TIME=y
The generic ktime_get function defined in kernel/hrtimer.c is suboptimial
for GENERIC_TIME=y:

 0)               |  ktime_get() {
 0)               |    ktime_get_ts() {
 0)               |      getnstimeofday() {
 0)               |        read_tod_clock() {
 0)   0.601 us    |        }
 0)   1.938 us    |      }
 0)               |      set_normalized_timespec() {
 0)   0.602 us    |      }
 0)   4.375 us    |    }
 0)   5.523 us    |  }

Overall there are two read_seqbegin/read_seqretry loops and a lot of
unnecessary struct timespec calculations. ktime_get returns a nano second
value which is the sum of xtime, wall_to_monotonic and the nano second
delta from the clock source.

ktime_get can be optimized for GENERIC_TIME=y. The new version only calls
clocksource_read:

 0)               |  ktime_get() {
 0)               |    read_tod_clock() {
 0)   0.610 us    |    }
 0)   1.977 us    |  }

It uses a single read_seqbegin/readseqretry loop and just adds everthing
to a nano second value.

ktime_get_ts is optimized in a similar fashion.

[ tglx: added WARN_ON(timekeeping_suspended) as in getnstimeofday() ]

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090707112728.3005244d@skybase>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-07-07 12:47:33 +02:00
Heiko Carstens 507e123151 timer stats: Optimize by adding quick check to avoid function calls
When the kernel is configured with CONFIG_TIMER_STATS but timer
stats are runtime disabled we still get calls to
__timer_stats_timer_set_start_info which initializes some
fields in the corresponding struct timer_list.

So add some quick checks in the the timer stats setup functions
to avoid function calls to __timer_stats_timer_set_start_info
when timer stats are disabled.

In an artificial workload that does nothing but playing ping
pong with a single tcp packet via loopback this decreases cpu
consumption by 1 - 1.5%.

This is part of a modified function trace output on SLES11:

 perl-2497  [00] 28630647177732388 [+  125]: sk_reset_timer <-tcp_v4_rcv
 perl-2497  [00] 28630647177732513 [+  125]: mod_timer <-sk_reset_timer
 perl-2497  [00] 28630647177732638 [+  125]: __timer_stats_timer_set_start_info <-mod_timer
 perl-2497  [00] 28630647177732763 [+  125]: __mod_timer <-mod_timer
 perl-2497  [00] 28630647177732888 [+  125]: __timer_stats_timer_set_start_info <-__mod_timer
 perl-2497  [00] 28630647177733013 [+   93]: lock_timer_base <-__mod_timer

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mustafa Mesanovic <mustafa.mesanovic@de.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <20090623153811.GA4641@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-24 11:15:09 +02:00
Linus Torvalds 38df92b8ce Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  NOHZ: Properly feed cpufreq ondemand governor
2009-06-20 10:51:44 -07:00
Linus Torvalds 19035e5b5d Merge branch 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: Logic to move non pinned timers
  timers: /proc/sys sysctl hook to enable timer migration
  timers: Identifying the existing pinned timers
  timers: Framework for identifying pinned timers
  timers: allow deferrable timers for intervals tv2-tv5 to be deferred

Fix up conflicts in kernel/sched.c and kernel/timer.c manually
2009-06-15 10:06:19 -07:00
Linus Torvalds f9db6e0951 Merge branch 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clockevent: export register_device and delta2ns
  clockevents: tick_broadcast_device can become static
2009-06-15 09:58:50 -07:00
Linus Torvalds 3f27c0d2a4 Merge branch 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clocksource: prevent selection of low resolution clocksourse also for nohz=on
  clocksource: sanity check sysfs clocksource changes
2009-06-15 09:58:33 -07:00
Thomas Gleixner cd6d95d844 clocksource: prevent selection of low resolution clocksourse also for nohz=on
commit 3f68535ada (clocksource: sanity check sysfs clocksource
changes) prevents selection of non high resolution capable
clocksources when high resolution mode is active, but did not take
into account that the same rules apply for highres=off nohz=on.

Check the tick device mode instead of hrtimer_hres_active() to verify
whether the system needs to be protected from a switch to jiffies or
other non highres capable clock sources.

Reported-by: Luming Yu <luming.yu@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-06-13 12:00:26 +02:00
john stultz 3f68535ada clocksource: sanity check sysfs clocksource changes
Thomas, Andrew and Ingo pointed out that we don't have any safety checks
in the clocksource sysfs entries to make sure sysadmins don't try to
change the clocksource to a non high-res timer capable clocksource (such
as jiffies) when high-res timers (HRT) is enabled.  Doing so will likely
hang a system.

Correct this by filtering non HRT clocksources from available_clocksources
and not accepting non HRT clocksources with HRT enabled.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-06-11 11:24:52 +02:00
Paul Mundt cf9fe114e3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-11 09:01:14 +03:00
Eero Nurkkala f2e21c9610 NOHZ: Properly feed cpufreq ondemand governor
A call from irq_exit() may occasionally pause the timing
info for cpufreq ondemand governor. This results in the
cpufreq ondemand governor to fail to calculate the 
system load properly. Thus, relocate the checks for this
particular case to keep the governor always functional.

Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Reported-by: Tero Kristo <tero.kristo@nokia.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-27 15:33:43 +02:00
Paul Mundt 5f8371cec9 Merge branches 'sh/stable-updates' and 'sh/sparseirq' 2009-05-22 13:29:37 +09:00
Thomas Gleixner dce48a84ad sched, timers: move calc_load() to scheduler
Dimitri Sivanich noticed that xtime_lock is held write locked across
calc_load() which iterates over all online CPUs. That can cause long
latencies for xtime_lock readers on large SMP systems. 

The load average calculation is an rough estimate anyway so there is
no real need to protect the readers vs. the update. It's not a problem
when the avenrun array is updated while a reader copies the values.

Instead of iterating over all online CPUs let the scheduler_tick code
update the number of active tasks shortly before the avenrun update
happens. The avenrun update itself is handled by the CPU which calls
do_timer().

[ Impact: reduce xtime_lock write locked section ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Arun R Bharadwaj eea08f32ad timers: Logic to move non pinned timers
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.

While migrating hrtimers, care should be taken to check if migrating
a hrtimer would result in a latency or not. So we compare the expiry of the
hrtimer with the next timer interrupt on the target cpu and migrate the
hrtimer only if it expires *after* the next interrupt on the target cpu.
So, added a clockevents_get_next_event() helper function to return the
next_event on the target cpu's clock_event_device.

[ tglx: cleanups and simplifications ]

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Arun R Bharadwaj 5c333864a6 timers: Identifying the existing pinned timers
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

[ tglx: fixup the hrtimer pinned mode ]

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Magnus Damm c81fc2c331 clockevent: export register_device and delta2ns
Export the following symbols using EXPORT_SYMBOL_GPL:
 - clockevent_delta2ns
 - clockevents_register_device

This allows us to build SuperH clockevent and clocksource
drivers as modules, see drivers/clocksource/sh_*.c

[ Impact: allow modular build of clockevent drivers ]

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090501055247.8286.64067.sendpatchset@rx1.opensource.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:51:07 +02:00
john stultz 7d27558c41 timekeeping: create arch_gettimeoffset infrastructure
Some arches don't supply their own clocksource. This is mainly the
case in architectures that get their inter-tick times by reading the
counter on their interval timer.  Since these timers wrap every tick,
they're not really useful as clocksources.  Wrapping them to act like
one is possible but not very efficient. So we provide a callout these
arches can implement for use with the jiffies clocksource to provide
finer then tick granular time.

[ Impact: ease the migration to generic time keeping ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:45:15 +02:00
Magnus Damm a25cbd045a clocksource: setup mult_orig in clocksource_enable()
Setup clocksource mult_orig in clocksource_enable().

Clocksource drivers can save power by using keeping the
device clock disabled while the clocksource is unused.

In practice this means that the enable() and disable()
callbacks perform clk_enable() and clk_disable().

The enable() callback may also use clk_get_rate() to get
the clock rate from the clock framework. This information
can then be used to calculate the shift and mult variables.

Currently the mult_orig variable is setup from mult at
registration time only. This is conflicting with the above
case since the clock is disabled and the mult variable is
not yet calculated at the time of registration.

Moving the mult_orig setup code to clocksource_enable()
allows us to both handle the common case with no enable()
callback and the mult-changed-after-enable() case.

[ Impact: allow dynamic clock source usage ]

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090501054546.8193.10688.sendpatchset@rx1.opensource.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:45:15 +02:00
Dmitri Vorobiev a52f5c5620 clockevents: tick_broadcast_device can become static
The variable tick_broadcast_device is not used outside of the
file where it is defined, so let's make it static.

Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 10:31:14 +02:00
john stultz 74a03b69d1 clockevents: prevent endless loop in tick_handle_periodic()
tick_handle_periodic() can lock up hard when a one shot clock event
device is used in combination with jiffies clocksource.

Avoid an endless loop issue by requiring that a highres valid
clocksource be installed before we call tick_periodic() in a loop when
using ONESHOT mode. The result is we will only increment jiffies once
per interrupt until a continuous hardware clocksource is available.

Without this, we can run into a endless loop, where each cycle through
the loop, jiffies is updated which increments time by tick_period or
more (due to clock steering), which can cause the event programming to
think the next event was before the newly incremented time and fail
causing tick_periodic() to be called again and the whole process loops
forever.

[ Impact: prevent hard lock up ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
2009-05-02 10:22:27 +02:00
Magnus Damm 4614e6adaf clocksource: add enable() and disable() callbacks
Add enable() and disable() callbacks for clocksources.

This allows us to put unused clocksources in power save mode.  The
functions clocksource_enable() and clocksource_disable() wrap the
callbacks and are inserted in the timekeeping code to enable before use
and disable after switching to a new clocksource.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-21 13:41:47 -07:00
Magnus Damm 8e19608e8b clocksource: pass clocksource to read() callback
Pass clocksource pointer to the read() callback for clocksources.  This
allows us to share the callback between multiple instances.

[hugh@veritas.com: fix powerpc build of clocksource pass clocksource mods]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-21 13:41:47 -07:00
Linus Torvalds 6671de344c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
  posix timers: fix RLIMIT_CPU && fork()
  time: ntp: fix bug in ntp_update_offset() & do_adjtimex(), fix
  time: ntp: clean up second_overflow()
  time: ntp: simplify ntp_tick_adj calculations
  time: ntp: make 64-bit constants more robust
  time: ntp: refactor do_adjtimex() some more
  time: ntp: refactor do_adjtimex()
  time: ntp: fix bug in ntp_update_offset() & do_adjtimex()
  time: ntp: micro-optimize ntp_update_offset()
  time: ntp: simplify ntp_update_offset_fll()
  time: ntp: refactor and clean up ntp_update_offset()
  time: ntp: refactor up ntp_update_frequency()
  time: ntp: clean up ntp_update_frequency()
  time: ntp: simplify the MAX_TICKADJ_SCALED definition
  time: ntp: simplify the second_overflow() code flow
  time: ntp: clean up kernel/time/ntp.c
  x86: hpet: stop HPET_COUNTER when programming periodic mode
  x86: hpet: provide separate functions to stop and start the counter
  x86: hpet: print HPET registers during setup (if hpet=verbose is used)
  time: apply NTP frequency/tick changes immediately
  ...
2009-03-26 16:05:42 -07:00
Ingo Molnar 7c526e1fef Merge branches 'timers/new-apis', 'timers/ntp' and 'timers/urgent' into timers/core 2009-03-26 15:45:52 +01:00
John Stultz a2a5ac8650 time: ntp: fix bug in ntp_update_offset() & do_adjtimex(), fix
The time_status conditional was accidentally placed right after we clear
the checked time_status bits, which causes us to take the conditional
every time through. This fixes it by moving the conditional to before we
clear the time_status bits.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 19:39:47 +01:00
Ingo Molnar 39854fe8c1 time: ntp: clean up second_overflow()
Impact: cleanup, no functionality changed

The 'time_adj' local variable is named in a very confusing
way because it almost shadows the 'time_adjust' global
variable - which is used in this same function.

Rename it to 'delta' - to make them stand apart more clearly.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2545	    114	    144	   2803	    af3	ntp.o.before
   2545	    114	    144	   2803	    af3	ntp.o.after

md5:
   1bf0b3be564512279ba7cee299d1d2be  ntp.o.before.asm
   1bf0b3be564512279ba7cee299d1d2be  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:17 +01:00
Ingo Molnar 069569e025 time: ntp: simplify ntp_tick_adj calculations
Impact: micro-optimization

Convert the (internal) ntp_tick_adj value we store from unscaled
units to scaled units. This is a constant that we never modify,
so scaling it up once during bootup is enough - we dont have to
do it for every adjustment step.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:16 +01:00
Ingo Molnar 2b9d1496e7 time: ntp: make 64-bit constants more robust
Impact: cleanup, no functionality changed

 - make PPM_SCALE an explicit s64 constant, to
   remove (s64) casts from usage sites.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2536	    114	    136	   2786	    ae2	ntp.o.before
   2536	    114	    136	   2786	    ae2	ntp.o.after

md5:
   40a7728d1188aa18e83e21a81fa7b150  ntp.o.before.asm
   40a7728d1188aa18e83e21a81fa7b150  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:15 +01:00
Ingo Molnar e96291653b time: ntp: refactor do_adjtimex() some more
Impact: cleanup, no functionality changed

Further simplify do_adjtimex():

 - introduce the ntp_start_leap_timer() helper function
 - eliminate the goto adj_done complication

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:15 +01:00
Ingo Molnar 80f2257116 time: ntp: refactor do_adjtimex()
Impact: cleanup, no functionality changed

do_adjtimex() is currently a monster function with a maze of
branches. Refactor the txc->modes setting aspects of it into
two new helper functions:

	process_adj_status()
	process_adjtimex_modes()

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2512	    114	    136	   2762	    aca	ntp.o.before
   2512	    114	    136	   2762	    aca	ntp.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:14 +01:00
Ingo Molnar 10dd31a7a1 time: ntp: fix bug in ntp_update_offset() & do_adjtimex()
Impact: change (fix) the way the NTP PLL seconds offset is initialized/tracked

Fix a bug and do a micro-optimization:

When PLL is enabled we do not reset time_reftime. If the PLL
was off for a long time (for example after bootup), this is
arguably the wrong thing to do.

We already had a hack for the common boot-time case in
ntp_update_offset(), in form of:

	if (unlikely(time_status & STA_FREQHOLD || time_reftime == 0))
 		secs = 0;

But the update delta should be reset later on too - not just when
the PLL is enabled for the first time after bootup.

So do it on !STA_PLL -> STA_PLL transitions.

This changes behavior, as previously if ntpd was disabled for
a long time and we restarted it, we'd run from that last update,
with a very large delta.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:13 +01:00
Ingo Molnar c7986acba2 time: ntp: micro-optimize ntp_update_offset()
Impact: cleanup, no functionality changed

The time_reftime update in ntp_update_offset() to xtime.tv_sec
is a convoluted way of saying that we want to freeze the frequency
and want the 'secs' delta to be 0. Also make this branch unlikely.

This shaves off 8 bytes from the code size:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2496	    114	    136	   2746	    aba	ntp.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:12 +01:00
Ingo Molnar 478b7aab16 time: ntp: simplify ntp_update_offset_fll()
Impact: cleanup, no functionality changed

Change ntp_update_offset_fll() to delta logic instead of
absolute value logic. This eliminates 'freq_adj' from the
function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:12 +01:00
Ingo Molnar f939890b66 time: ntp: refactor and clean up ntp_update_offset()
Impact: cleanup, no functionality changed

- introduce the ntp_update_offset_fll() helper
- clean up the flow and variable naming

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   01f7b8e1a5472a3056f9e4ae84d46315  ntp.o.before.asm
   01f7b8e1a5472a3056f9e4ae84d46315  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:11 +01:00
Ingo Molnar bc26c31d44 time: ntp: refactor up ntp_update_frequency()
Impact: cleanup, no functionality changed

Change ntp_update_frequency() from a hard to follow code
flow that uses global variables as temporaries, to a clean
input+output flow.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:10 +01:00
Ingo Molnar 9ce616aaef time: ntp: clean up ntp_update_frequency()
Impact: cleanup, no functionality changed

Prepare a refactoring of ntp_update_frequency().

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   41f3009debc9b397d7394dd77d912f0a  ntp.o.before.asm
   41f3009debc9b397d7394dd77d912f0a  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:09 +01:00
Ingo Molnar bbd1267690 time: ntp: simplify the MAX_TICKADJ_SCALED definition
Impact: cleanup, no functionality changed

There's an ugly u64 typecase in the MAX_TICKADJ_SCALED definition,
this can be eliminated by making the MAX_TICKADJ constant's type
64-bit (signed).

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   41f3009debc9b397d7394dd77d912f0a  ntp.o.before.asm
   41f3009debc9b397d7394dd77d912f0a  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:08 +01:00
Ingo Molnar 3c972c2444 time: ntp: simplify the second_overflow() code flow
Impact: cleanup, no functionality changed

Instead of a hierarchy of conditions, transform them to clean
gradual conditions and return's.

This makes the flow easier to read and makes the purpose of
the function easier to understand.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2552	    170	    168	   2890	    b4a	ntp.o.before
   2552	    170	    168	   2890	    b4a	ntp.o.after

md5:
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.before.asm
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:07 +01:00
Ingo Molnar 53bbfa9e94 time: ntp: clean up kernel/time/ntp.c
Impact: cleanup, no functionality changed

Make this file a bit more readable by applying a consistent coding style.

No code changed:

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2552	    170	    168	   2890	    b4a	ntp.o.before
   2552	    170	    168	   2890	    b4a	ntp.o.after

md5:
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.before.asm
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:06 +01:00
john stultz fdcedf7b75 time: apply NTP frequency/tick changes immediately
Since the GENERIC_TIME changes landed, the adjtimex behavior changed
for struct timex.tick and .freq changed. When the tick or freq value
is set, we adjust the tick_length_base in ntp_update_frequency().
However, this new value doesn't get applied to tick_length until the
next second (via second_overflow).

This means some applications that do quick time tweaking do not see the
requested change made as quickly as expected.

I've run a few tests with this change, and ntpd still functions fine.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 10:10:08 +01:00
Patrick Ohly a75244c3d5 timecompare: generic infrastructure to map between two time bases
Mapping from a struct timecounter to a time returned by functions like
ktime_get_real() is implemented. This is sufficient to use this code
in a network device driver which wants to support hardware time
stamping and transformation of hardware time stamps to system time.

The interface could have been made more versatile by not depending on
a time counter, but this wasn't done to avoid writing glue code
elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf

Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:32 -08:00
Patrick Ohly a038a353c3 clocksource: allow usage independent of timekeeping.c
So far struct clocksource acted as the interface between time/timekeeping.c
and hardware. This patch generalizes the concept so that a similar
interface can also be used in other contexts. For that it introduces
new structures and related functions *without* touching the existing
struct clocksource.

The reasons for adding these new structures to clocksource.[ch] are
* the APIs are clearly related
* struct clocksource could be cleaned up to use the new structs
* avoids proliferation of files with similar names (timesource.h?
  timecounter.h?)

As outlined in the discussion with John Stultz, this patch adds
* struct cyclecounter: stateless API to hardware which counts clock cycles
* struct timecounter: stateful utility code built on a cyclecounter which
  provides a nanosecond counter
* only the function to read the nanosecond counter; deltas are used internally
  and not exposed to users of timecounter

The code does no locking of the shared state. It must be called at least
as often as the cycle counter wraps around to detect these wrap arounds.
Both is the responsibility of the timecounter user.

Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:31 -08:00
Ingo Molnar 5ba1ae92b6 Merge branches 'timers/clockevents', 'timers/hpet', 'timers/hrtimers' and 'timers/urgent' into timers/core 2009-02-08 20:14:11 +01:00
Sebastien Dugue 94df7de028 hrtimers: allow the hot-unplugging of all cpus
Impact: fix CPU hotplug hang on Power6 testbox

On architectures that support offlining all cpus (at least powerpc/pseries),
hot-unpluging the tick_do_timer_cpu can result in a system hang.

This comes from the fact that if the cpu going down happens to be the
cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
cpu is dead (via the CPU_DEAD notification), we're left without ticks,
jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
That's particularly the case for the cpu looping in __cpu_die() waiting
for the dying cpu to be dead.

This patch addresses this by having the tick_do_timer_cpu handover happen
earlier during the CPU_DYING notification. For this, a new clockevent
notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
in hrtimer_cpu_notify().

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30 22:35:29 +01:00
Magnus Damm 2d68259db2 clockevents: let set_mode() setup delta information
Allow the set_mode() clockevent callback to decide and fill in delta
details such as shift, mult, max_delta_ns and min_delta_ns.

With this change the clockevent can be registered without delta details
which allows us to keep the parent clock disabled until the clockevent
gets setup using set_mode().

Letting set_mode() fill in or update delta details allows us to save
power by disabling the parent clock while the clockevent is unused.
This may however make the parent clock rate change, so next time the
clockevent gets enabled we need let set_mode() to update the detla
details accordingly. Doing it at registration time is not enough.

Furthermore, the delta details seem unused in the case of periodic-only
clockevent drivers, so this change also allows registration of such
drivers without the delta details filled in.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-01-16 12:27:39 +01:00
Jaswinder Singh Rajput 934d96eafa time-sched.c: tick_nohz_update_jiffies should be static
Impact: cleanup, reduce kernel size a bit, avoid sparse warning

Fixes sparse warning:

 kernel/time/tick-sched.c:137:6: warning: symbol 'tick_nohz_update_jiffies' was not declared. Should it be static?

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 12:06:56 +01:00
Ingo Molnar e3ee1e1231 Merge commit 'v2.6.29-rc1' into timers/hrtimers
Conflicts:
	kernel/time/tick-common.c
2009-01-12 11:32:03 +01:00
Linus Torvalds 57c44c5f6f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  trivial: chack -> check typo fix in main Makefile
  trivial: Add a space (and a comma) to a printk in 8250 driver
  trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
  trivial: Fix misspelling of "firmware" in powerpc Makefile
  trivial: Fix misspelling of "firmware" in usb.c
  trivial: Fix misspelling of "firmware" in qla1280.c
  trivial: Fix misspelling of "firmware" in a100u2w.c
  trivial: Fix misspelling of "firmware" in megaraid.c
  trivial: Fix misspelling of "firmware" in ql4_mbx.c
  trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
  trivial: Fix misspelling of "firmware" in ipw2100.c
  trivial: Fix misspelling of "firmware" in atmel.c
  trivial: Fix misspelled firmware in Kconfig
  trivial: fix an -> a typos in documentation and comments
  trivial: fix then -> than typos in comments and documentation
  trivial: update Jesper Juhl CREDITS entry with new email
  trivial: fix singal -> signal typo
  trivial: Fix incorrect use of "loose" in event.c
  trivial: printk: fix indentation of new_text_line declaration
  trivial: rtc-stk17ta8: fix sparse warning
  ...
2009-01-07 11:31:52 -08:00
Frederik Schwarzer 025dfdafe7 trivial: fix then -> than typos in comments and documentation
- (better, more, bigger ...) then -> (...) than

Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-01-06 11:28:06 +01:00
Ingo Molnar d9be28ea91 Merge branches 'sched/clock', 'sched/cleanups' and 'linus' into sched/urgent 2009-01-06 09:33:57 +01:00
Linus Torvalds 7d3b56ba37 Merge branch 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
  x86: setup_per_cpu_areas() cleanup
  cpumask: fix compile error when CONFIG_NR_CPUS is not defined
  cpumask: use alloc_cpumask_var_node where appropriate
  cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
  x86: use cpumask_var_t in acpi/boot.c
  x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
  sched: put back some stack hog changes that were undone in kernel/sched.c
  x86: enable cpus display of kernel_max and offlined cpus
  ia64: cpumask fix for is_affinity_mask_valid()
  cpumask: convert RCU implementations, fix
  xtensa: define __fls
  mn10300: define __fls
  m32r: define __fls
  h8300: define __fls
  frv: define __fls
  cris: define __fls
  cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
  cpumask: zero extra bits in alloc_cpumask_var_node
  cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
  cpumask: convert mm/
  ...
2009-01-03 12:04:39 -08:00
Linus Torvalds 61420f59a5 Merge branch 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [PATCH] fast vdso implementation for CLOCK_THREAD_CPUTIME_ID
  [PATCH] improve idle cputime accounting
  [PATCH] improve precision of idle time detection.
  [PATCH] improve precision of process accounting.
  [PATCH] idle cputime accounting
  [PATCH] fix scaled & unscaled cputime accounting
2009-01-03 11:56:24 -08:00
Mike Travis 7eb1955336 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask into merge-rr-cpumask
Conflicts:
	arch/x86/kernel/io_apic.c
	kernel/rcuclassic.c
	kernel/sched.c
	kernel/time/tick-sched.c

Signed-off-by: Mike Travis <travis@sgi.com>
[ mingo@elte.hu: backmerged typo fix for io_apic.c ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-03 18:53:31 +01:00
Linus Torvalds b840d79631 Merge branch 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
  x86: export vector_used_by_percpu_irq
  x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
  sched: nominate preferred wakeup cpu, fix
  x86: fix lguest used_vectors breakage, -v2
  x86: fix warning in arch/x86/kernel/io_apic.c
  sched: fix warning in kernel/sched.c
  sched: move test_sd_parent() to an SMP section of sched.h
  sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
  sched: activate active load balancing in new idle cpus
  sched: bias task wakeups to preferred semi-idle packages
  sched: nominate preferred wakeup cpu
  sched: favour lower logical cpu number for sched_mc balance
  sched: framework for sched_mc/smt_power_savings=N
  sched: convert BALANCE_FOR_xx_POWER to inline functions
  x86: use possible_cpus=NUM to extend the possible cpus allowed
  x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
  x86: update io_apic.c to the new cpumask code
  x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
  x86: xen: use smp_call_function_many()
  x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
  ...

Fixed up trivial conflict in kernel/time/tick-sched.c manually
2009-01-02 11:44:09 -08:00
Rusty Russell 5db0e1e9e0 cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
Impact: cleanup

Simple replacement, now the _nr is redundant.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
2009-01-01 10:12:29 +10:30
Rusty Russell 6b954823c2 cpumask: convert kernel time functions
Impact: Use new APIs

Convert kernel/time functions to use struct cpumask *.

Note the ugly bitmap declarations in tick-broadcast.c.  These should
be cpumask_var_t, but there was no obvious initialization function to
put the alloc_cpumask_var() calls in.  This was safe.

(Eventually 'struct cpumask' will be undefined for CONFIG_CPUMASK_OFFSTACK,
so we use a bitmap here to show we really mean it).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-01 10:12:25 +10:30
Martin Schwidefsky 79741dd357 [PATCH] idle cputime accounting
The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-12-31 15:11:46 +01:00
Martin Schwidefsky 457533a7d3 [PATCH] fix scaled & unscaled cputime accounting
The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-12-31 15:11:46 +01:00
Rusty Russell 2ca1a61583 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	arch/x86/kernel/io_apic.c
2008-12-31 23:05:57 +10:30
Thomas Gleixner 1c5745aa38 sched_clock: prevent scd->clock from moving backwards, take #2
Redo:

  5b7dba4: sched_clock: prevent scd->clock from moving backwards

which had to be reverted due to s2ram hangs:

  ca7e716: Revert "sched_clock: prevent scd->clock from moving backwards"

... this time with resume restoring GTOD later in the sequence
taken into account as well.

The "timekeeping_suspended" flag is not very nice but we cannot call into
GTOD before it has been properly resumed and the scheduler will run very
early in the resume sequence.

Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-31 09:53:21 +01:00
Sebastien Dugue 5762ba1873 hrtimers: allow the hot-unplugging of all cpus
Impact: fix CPU hotplug hang on Power6 testbox

On architectures that support offlining all cpus (at least powerpc/pseries),
hot-unpluging the tick_do_timer_cpu can result in a system hang.

This comes from the fact that if the cpu going down happens to be the
cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
cpu is dead (via the CPU_DEAD notification), we're left without ticks,
jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
That's particularly the case for the cpu looping in __cpu_die() waiting
for the dying cpu to be dead.

This patch addresses this by having the tick_do_timer_cpu handover happen
earlier during the CPU_DYING notification. For this, a new clockevent
notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
in hrtimer_cpu_notify().

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-30 07:37:19 +01:00
Ingo Molnar 32e8d18683 Merge branches 'timers/clocksource', 'timers/hpet', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/rtc' into timers/core 2008-12-25 18:02:25 +01:00
Rusty Russell 968ea6d80e Merge ../linux-2.6-x86
Conflicts:

	arch/x86/kernel/io_apic.c
	kernel/sched.c
	kernel/sched_stats.h
2008-12-13 21:55:51 +10:30
Rusty Russell 320ab2b0b1 cpumask: convert struct clock_event_device to cpumask pointers.
Impact: change calling convention of existing clock_event APIs

struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.

Another single-patch change.  For safety, we BUG_ON() in
clockevents_register_device() if it's not set.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
2008-12-13 21:20:26 +10:30
Rusty Russell 0de26520c7 cpumask: make irq_set_affinity() take a const struct cpumask
Impact: change existing irq_chip API

Not much point with gentle transition here: the struct irq_chip's
setaffinity method signature needs to change.

Fortunately, not widely used code, but hits a few architectures.

Note: In irq_select_affinity() I save a temporary in by mangling
irq_desc[irq].affinity directly.  Ingo, does this break anything?

(Folded in fix from KOSAKI Motohiro)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: jeremy@xensource.com
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
2008-12-13 21:20:26 +10:30
Woodruff, Richard 001474491f nohz: suppress needless timer reprogramming
In my device I get many interrupts from a high speed USB device in a very
short period of time.  The system spends a lot of time reprogramming the
hardware timer which is in a slower timing domain as compared to the CPU. 
This results in the CPU spending a huge amount of time waiting for the
timer posting to be done.  All of this reprogramming is useless as the
wake up time has not changed.

As measured using ETM trace this drops my reprogramming penalty from
almost 60% CPU load down to 15% during high interrupt rate.  I can send
traces to show this.

Suppress setting of duplicate timer event when timer already stopped. 
Timer programming can be very costly and can result in long cpu stall/wait
times.

[akpm@linux-foundation.org: coding-style fixes]
[tglx@linutronix.de: move the check to the right place and avoid raising
		     the softirq for nothing]

Signed-off-by: Richard Woodruff <r-woodruff2@ti.com>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-12-12 16:55:31 +01:00
Ingo Molnar 45ab6b0c76 Merge branch 'sched/core' into cpus4096
Conflicts:
	include/linux/ftrace.h
	kernel/sched.c
2008-12-12 13:48:57 +01:00
Heiko Carstens fa116ea35e nohz: no softirq pending warnings for offline cpus
Impact: remove false positive warning

After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
it still might have pending softirqs. However take_cpu_down makes sure
that the idle task will run next instead of ksoftirqd on the taken down cpu.
The idle task will call tick_nohz_stop_sched_tick which might warn about
pending softirqs just before the cpu kills itself completely.

However the pending softirqs on the dead cpu aren't a problem because they
will be moved to an online cpu during CPU_DEAD handling.

So make sure we warn only for online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 07:27:01 +01:00
john stultz 6c9bacb41c time: catch xtime_nsec underflows and fix them
Impact: fix time warp bug

Alex Shi, along with Yanmin Zhang have been noticing occasional time
inconsistencies recently. Through their great diagnosis, they found that
the xtime_nsec value used in update_wall_time was occasionally going
negative. After looking through the code for awhile, I realized we have
the possibility for an underflow when three conditions are met in
update_wall_time():

  1) We have accumulated a second's worth of nanoseconds, so we
     incremented xtime.tv_sec and appropriately decrement xtime_nsec.
     (This doesn't cause xtime_nsec to go negative, but it can cause it
      to be small).

  2) The remaining offset value is large, but just slightly less then
     cycle_interval.

  3) clocksource_adjust() is speeding up the clock, causing a
     corrective amount (compensating for the increase in the multiplier
     being multiplied against the unaccumulated offset value) to be
     subtracted from xtime_nsec.

This can cause xtime_nsec to underflow.

Unfortunately, since we notify the NTP subsystem via second_overflow()
whenever we accumulate a full second, and this effects the error
accumulation that has already occured, we cannot simply revert the
accumulated second from xtime nor move the second accumulation to after
the clocksource_adjust call without a change in behavior.

This leaves us with (at least) two options:

1) Simply return from clocksource_adjust() without making a change if we
   notice the adjustment would cause xtime_nsec to go negative.

This would work, but I'm concerned that if a large adjustment was needed
(due to the error being large), it may be possible to get stuck with an
ever increasing error that becomes too large to correct (since it may
always force xtime_nsec negative). This may just be paranoia on my part.

2) Catch xtime_nsec if it is negative, then add back the amount its
   negative to both xtime_nsec and the error.

This second method is consistent with how we've handled earlier rounding
issues, and also has the benefit that the error being added is always in
the oposite direction also always equal or smaller then the correction
being applied. So the risk of a corner case where things get out of
control is lessened.

This patch fixes bug 11970, as tested by Yanmin Zhang
http://bugzilla.kernel.org/show_bug.cgi?id=11970

Reported-by: alex.shi@intel.com
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-04 08:43:02 +01:00