Commit Graph

330 Commits

Author SHA1 Message Date
Rafael J. Wysocki 95d6c0857e cpufreq: intel_pstate: Register when ACPI PCCH is present
Currently, intel_pstate doesn't register if _PSS is not present on
HP Proliant systems, because it expects the firmware to take over
CPU performance scaling in that case.  However, if ACPI PCCH is
present, the firmware expects the kernel to use it for CPU
performance scaling and the pcc-cpufreq driver is loaded for that.

Unfortunately, the firmware interface used by that driver is not
scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
on systems with more than just a few CPUs.  In fact, it is better to
avoid using it at all.

For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
is not present and register if it is there.  Also prevent the
pcc-cpufreq driver from trying to initialize itself if intel_pstate
has been registered already.

Fixes: fbbcdc0744 (intel_pstate: skip the driver if ACPI has power mgmt option)
Reported-by: Andreas Herrmann <aherrmann@suse.com>
Reviewed-by: Andreas Herrmann <aherrmann@suse.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Andreas Herrmann <aherrmann@suse.com>
Cc: 4.16+ <stable@vger.kernel.org> # 4.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-07-18 13:38:37 +02:00
Srinivas Pandruvada ff7c991714 cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0
When scaling max/min settings are changed, internally they are converted
to a ratio using the max turbo 1 core turbo frequency. This works fine
when 1 core max is same irrespective of the core. But under Turbo 3.0,
this will not be the case. For example:
Core 0: max turbo pstate: 43 (4.3GHz)
Core 1: max turbo pstate: 45 (4.5GHz)
In this case 1 core turbo ratio will be maximum of all, so it will be
45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores
,then on core one it will be
 = max_state * policy->max / max_freq;
 = 43 * (4000000/4500000) = 38 (3.8GHz)
 = 38
which is 200MHz less than the desired.
On core2, it will be correctly set to ratio 40 (4GHz). Same holds true
for scaling min frequency limit. So this requires usage of correct turbo
max frequency for core one, which in this case is 4.3GHz. So we need to
adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that
core.

This change uses the HWP capability of a core to adjust max turbo
frequency. But since Broadwell HWP doesn't use ratios in the HWP
capabilities, we have to use legacy max 1 core turbo ratio. This is not
a problem as the HWP capabilities don't differ among cores in Broadwell.
We need to check for non Broadwell CPU model for applying this change,
though.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 10:40:29 +02:00
Linus Torvalds d09fcecb0c Additional power management updates for 4.18-rc1
- Revert a recent PM core change that attempted to fix an issue
    related to device links, but introduced a regression (Rafael
    Wysocki).
 
  - Fix build when the recently added cpufreq driver for Kryo
    processors is selected by making it possible to build that
    driver as a module (Arnd Bergmann).
 
  - Fix the long idle detection mechanism in the out-of-band
    (ondemand and conservative) cpufreq governors (Chen Yu).
 
  - Add support for devices in multiple power domains to the
    generic power domains (genpd) framework (Ulf Hansson).
 
  - Add support for iowait boosting on systems with hardware-managed
    P-states (HWP) enabled to the intel_pstate driver and make it use
    that feature on systems with Skylake Xeon processors as it is
    reported to improve performance significantly on those systems
    (Srinivas Pandruvada).
 
  - Fix and update the acpi_cpufreq, ti-cpufreq and imx6q cpufreq
    drivers (Colin Ian King, Suman Anna, Sébastien Szymanski).
 
  - Change the behavior of the wakeup_count device attribute in
    sysfs to expose the number of events when the device might have
    aborted system suspend in progress (Ravi Chandra Sadineni).
 
  - Fix two minor issues in the cpupower utility (Abhishek Goel,
    Colin Ian King).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJbIOf+AAoJEILEb/54YlRxk5EQAIyLpvR0zdp2gMaMl3rbWqtM
 W6XpJbLzL4be9zHKDj4bycO6nbevPOr5oXgm3DQUaUvkLo86cUl2NJlNAv789UZR
 NQ8L51WiY4hG4WDrBQntEBw7TDBUDuo6TEa2/0WJQQhj6WQP821oehmF4G+N9A9h
 z9YhwbWNgivulyNy09nAcVgJ39cxUVWb9EmTXthp0KnyJzn8de+V3MxlEwJTAmHc
 jma9PEil9Key2rS8LRr+djvwa6tYKydOCjkA+o6m7Fo1IVaaVydDgciG4tjnsHNV
 wtEfbOZnisnkYrNEbViqQhhnsvSLkTtfAku58Ove5Kz2GPSPjyIoRrK7FUfDetr+
 ZQLWq6TPzR9u2m3kQfhHB6C463bGxd4s2BntPH2RLHbs82FENEtGkHdxQOv5B1tW
 Gvl9gF9ZDov6gL3jftNdhIz4rQVGaXQlY5/q+alV1I3jhyg7zddht4oh+nNt41XR
 ysszEg9K62w/QAuqZeUsHaR7pPoZZDQzr3TRkKX0uvl88jq4HUPj+aKqNYxq0IrZ
 uYd92gqvD7HH1UKRPqjvZ65Uj5WTbn7picAYJhTlQR4b73X0j66xDSZp/IZVpbEc
 ierDftBxdwklnfxrpy19yJKgIDB89zLP0IX+3BacEC+BWguI//MOb5X0EEpcf/WK
 eyG13J1wTF1qLzKDdur9
 =VROk
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These revert a recent PM core change that introduced a regression, fix
  the build when the recently added Kryo cpufreq driver is selected, add
  support for devices attached to multiple power domains to the generic
  power domains (genpd) framework, add support for iowait boosting on
  systens with hardware-managed P-states (HWP) enabled to the
  intel_pstate driver, modify the behavior of the wakeup_count device
  attribute in sysfs, fix a few issues and clean up some ugliness,
  mostly in cpufreq (core and drivers) and in the cpupower utility.

  Specifics:

   - Revert a recent PM core change that attempted to fix an issue
     related to device links, but introduced a regression (Rafael
     Wysocki)

   - Fix build when the recently added cpufreq driver for Kryo
     processors is selected by making it possible to build that driver
     as a module (Arnd Bergmann)

   - Fix the long idle detection mechanism in the out-of-band (ondemand
     and conservative) cpufreq governors (Chen Yu)

   - Add support for devices in multiple power domains to the generic
     power domains (genpd) framework (Ulf Hansson)

   - Add support for iowait boosting on systems with hardware-managed
     P-states (HWP) enabled to the intel_pstate driver and make it use
     that feature on systems with Skylake Xeon processors as it is
     reported to improve performance significantly on those systems
     (Srinivas Pandruvada)

   - Fix and update the acpi_cpufreq, ti-cpufreq and imx6q cpufreq
     drivers (Colin Ian King, Suman Anna, Sébastien Szymanski)

   - Change the behavior of the wakeup_count device attribute in sysfs
     to expose the number of events when the device might have aborted
     system suspend in progress (Ravi Chandra Sadineni)

   - Fix two minor issues in the cpupower utility (Abhishek Goel, Colin
     Ian King)"

* tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "PM / runtime: Fixup reference counting of device link suppliers at probe"
  cpufreq: imx6q: check speed grades for i.MX6ULL
  cpufreq: governors: Fix long idle detection logic in load calculation
  cpufreq: intel_pstate: enable boost for Skylake Xeon
  PM / wakeup: Export wakeup_count instead of event_count via sysfs
  PM / Domains: Add dev_pm_domain_attach_by_id() to manage multi PM domains
  PM / Domains: Add support for multi PM domains per device to genpd
  PM / Domains: Split genpd_dev_pm_attach()
  PM / Domains: Don't attach devices in genpd with multi PM domains
  PM / Domains: dt: Allow power-domain property to be a list of specifiers
  cpufreq: intel_pstate: New sysfs entry to control HWP boost
  cpufreq: intel_pstate: HWP boost performance on IO wakeup
  cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
  cpufreq: ti-cpufreq: Use devres managed API in probe()
  cpufreq: ti-cpufreq: Fix an incorrect error return value
  cpufreq: ACPI: make function acpi_cpufreq_fast_switch() static
  cpufreq: kryo: allow building as a loadable module
  cpupower : Fix header name to read idle state name
  cpupower: fix spelling mistake: "logilename" -> "logfilename"
2018-06-13 07:24:18 -07:00
Kees Cook fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Srinivas Pandruvada 41ab43c9c8 cpufreq: intel_pstate: enable boost for Skylake Xeon
Enable HWP boost on Skylake server and workstations.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-08 11:10:26 +02:00
Srinivas Pandruvada aaaece3de9 cpufreq: intel_pstate: New sysfs entry to control HWP boost
A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 08:37:52 +02:00
Srinivas Pandruvada 52ccc43142 cpufreq: intel_pstate: HWP boost performance on IO wakeup
This change uses SCHED_CPUFREQ_IOWAIT flag to boost HWP performance.
Since SCHED_CPUFREQ_IOWAIT flag is set frequently, we don't start
boosting steps unless we see two consecutive flags in two ticks. This
avoids boosting due to IO because of regular system activities.

To avoid synchronization issues, the actual processing of the flag is
done on the local CPU callback.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 08:37:52 +02:00
Srinivas Pandruvada e0efd5be63 cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
Added two utility functions to HWP boost up gradually and boost down to
the default cached HWP request values.

Boost up:
Boost up updates HWP request minimum value in steps. This minimum value
can reach upto at HWP request maximum values depends on how frequently,
this boost up function is called. At max, boost up will take three steps
to reach the maximum, depending on the current HWP request levels and HWP
capabilities. For example, if the current settings are:
If P0 (Turbo max) = P1 (Guaranteed max) = min
        No boost at all.
If P0 (Turbo max) > P1 (Guaranteed max) = min
        Should result in one level boost only for P0.
If P0 (Turbo max) = P1 (Guaranteed max) > min
        Should result in two level boost:
                (min + p1)/2 and P1.
If P0 (Turbo max) > P1 (Guaranteed max) > min
        Should result in three level boost:
                (min + p1)/2, P1 and P0.
We don't set any level between P0 and P1 as there is no guarantee that
they will be honored.

Boost down:
After the system is idle for hold time of 3ms, the HWP request is reset
to the default value from HWP init or user modified one via sysfs.

Caching of HWP Request and Capabilities
Store the HWP request value last set using MSR_HWP_REQUEST and read
MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility
functions.

These boost utility functions calculated limits are based on the latest
HWP request value, which can be modified by setpolicy() callback. So if
user space modifies the minimum perf value, that will be accounted for
every time the boost up is called. There will be case when there can be
contention with the user modified minimum perf, in that case user value
will gain precedence. For example just before HWP_REQUEST MSR is updated
from setpolicy() callback, the boost up function is called via scheduler
tick callback. Here the cached MSR value is already the latest and limits
are updated based on the latest user limits, but on return the MSR write
callback called from setpolicy() callback will update the HWP_REQUEST
value. This will be used till next time the boost up function is called.

In addition add a variable to control HWP dynamic boosting. When HWP
dynamic boost is active then set the HWP specific update util hook. The
contents in the utility hooks will be filled in the subsequent patches.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 08:37:52 +02:00
Doug Smythies 50e9ffaba5 cpufreq: intel_pstate: allow trace in passive mode
Allow use of the trace_pstate_sample trace function
when the intel_pstate driver is in passive mode.
Since the core_busy and scaled_busy fields are not
used, and it might be desirable to know which path
through the driver was used, either intel_cpufreq_target
or intel_cpufreq_fast_switch, re-task the core_busy
field as a flag indicator.

The user can then use the intel_pstate_tracer.py utility
to summarize and plot the trace.

Note: The core_busy feild still goes by that name
in include/trace/events/power.h and within the
intel_pstate_tracer.py script and csv file headers,
but it is graphed as "performance", and called
core_avg_perf now in the intel_pstate driver.

Sometimes, in passive mode, the driver is not called for
many tens or even hundreds of seconds. The user
needs to understand, and not be confused by, this limitation.

Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-14 23:34:25 +02:00
Rafael J. Wysocki b258dfeafb cpufreq: intel_pstate: Do not include debugfs.h
The intel_pstate driver doesn't use debugfs any more, so drop
linux/debugfs.h from the list of included headers in it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2018-04-10 08:38:02 +02:00
Chen Yu 70f6bf2a3b cpufreq: intel_pstate: Enable HWP during system resume on CPU0
When maxcpus=1 is in the kernel command line, the BP is responsible
for re-enabling the HWP - because currently only the APs invoke
intel_pstate_hwp_enable() during their online process - which might
put the system into unstable state after resume.

Fix this by enabling the HWP explicitly on BP during resume.

Reported-by: Doug Smythies <dsmythies@telus.net>
Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Yu Chen <yu.c.chen@intel.com>
[ rjw: Subject/changelog, minor modifications ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08 10:21:38 +01:00
Srinivas Pandruvada d8de7a44e1 cpufreq: intel_pstate: Add Skylake servers support
Currently intel_pstate can function only in HWP mode on Skylake servers.
When HWP feature is not enabled on the processor then acpi-cpufreq is
driver is used.

Based on the power and performance tests using intel_pstate scaling
algorithm the results are comparable. But intel_pstate brings in
additional features:
 - Display of turbo frequency range, which many users like to see
 - Place limits in the turbo frequency range when platform allows

Since these tests are done only using non PID algorithm introduced in
kernel version 4.14, this patch is not a backport candidate. So each user
has to carefully weigh the benefits before he backports.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-11 18:57:20 +01:00
Srinivas Pandruvada dbd49b85ee cpufreq: intel_pstate: Replace bxt_funcs with core_funcs
Since core_funcs and bxt_funcs have same set of callbacks, replace
bxt_funcs with core_funcs.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-11 18:57:20 +01:00
Linus Torvalds 53ac64aac9 ACPI updates for v4.14-rc1
- Update the ACPICA code in the kernel to upstream revision 20170728
    including:
    * Alias operator handling update (Bob Moore).
    * Deferred resolution of reference package elements (Bob Moore).
    * Support for the _DMA method in walk resources (Bob Moore).
    * Tables handling update and support for deferred table
      verification (Lv Zheng).
    * Update of SMMU models for IORT (Robin Murphy).
    * Compiler and disassembler updates (Alex James, Erik Schmauss,
      Ganapatrao Kulkarni, James Morse).
    * Tools updates (Erik Schmauss, Lv Zheng).
    * Assorted minor fixes and cleanups (Bob Moore, Kees Cook,
      Lv Zheng, Shao Ming).
 
  - Rework the initialization of non-wakeup GPEs with method handlers
    in order to address a boot crash on some systems with Thunderbolt
    devices connected at boot time where we miss an early hotplug
    event due to a delay in GPE enabling (Rafael Wysocki).
 
  - Rework the handling of PCI bridges when setting up ACPI-based
    device wakeup in order to avoid disabling wakeup for bridges
    prematurely (Rafael Wysocki).
 
  - Consolidate Apple DMI checks throughout the tree, add support for
    Apple device properties to the device properties framework and
    use these properties for the handling of I2C and SPI devices on
    Apple systems (Lukas Wunner).
 
  - Add support for _DMA to the ACPI-based device properties lookup
    code and make it possible to use the information from there to
    configure DMA regions on ARM64 systems (Lorenzo Pieralisi).
 
  - Fix several issues in the APEI code, add support for exporting
    the BERT error region over sysfs and update APEI MAINTAINERS
    entry with reviewers information (Borislav Petkov, Dongjiu Geng,
    Loc Ho, Punit Agrawal, Tony Luck, Yazen Ghannam).
 
  - Fix a potential initialization ordering issue in the ACPI EC
    driver and clean it up somewhat (Lv Zheng).
 
  - Update the ACPI SPCR driver to extend the existing XGENE 8250
    workaround in it to a new platform (m400) and to work around
    an Xgene UART clock issue (Graeme Gregory).
 
  - Add a new utility function to the ACPI core to support using
    ACPI OEM ID / OEM Table ID / Revision for system identification
    in blacklisting or similar and switch over the existing code
    already using this information to this new interface (Toshi Kani).
 
  - Fix an xpower PMIC issue related to GPADC reads that always return
    0 without extra pin manipulations (Hans de Goede).
 
  - Add statements to print debug messages in a couple of places in
    the ACPI core for easier diagnostics (Rafael Wysocki).
 
  - Clean up the ACPI processor driver slightly (Colin Ian King,
    Hanjun Guo).
 
  - Clean up the ACPI x86 boot code somewhat (Andy Shevchenko).
 
  - Add a quirk for Dell OptiPlex 9020M to the ACPI backlight
    driver (Alex Hung).
 
  - Assorted fixes, cleanups and updates related to ACPI (Amitoj Kaur
    Chawla, Bhumika Goyal, Frank Rowand, Jean Delvare, Punit Agrawal,
    Ronald Tschalär, Sumeet Pawnikar).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZrcE+AAoJEILEb/54YlRxVGAP/RKzkJlYlOIXtMjf4XWg5ZfJ
 RKZA68E9DW179KoBoTCVPD6/eD5UoEJ7fsWXFU2Hgp2xL3N1mZMAJHgAE4GoAwCx
 uImoYvQgdPna7DawzRIFkvkfceYxNyh+KaV9s7xne4hAwsB7JzP9yf5Ywll53+oF
 Le27/r6lDOaWhG7uYcxSabnQsWZQkBF5mj2GPzEpKDIHcLA1Vii0URzm7mAHdZsz
 vGjYhxrshKYEVdkLSRn536m1rEfp2fqsRJ5wqNAazZJr6Cs1WIfNVuv/RfduRJpG
 /zHIRAmgKV+3jp39cBpjdnexLczb1rGiCV1yZOvwCNM7jy4evL8vbL7VgcUCopaj
 fHbF34chNG/hKJd3Zn3RRCTNzCs6bv+txslOMARxji5eyr2Q4KuVnvg5LM4hxOUP
 23FvcYkBYWu4QCNLOTnC7y2OqK6WzOvDpfi7hf13Z42iNzeAUbwt1sVF0/OCwL51
 Og6blSy2x8FidKp8oaBBboBzHEiKWnXBj/Hw8KEHVcsqZv1ZC6igNRAL3tjxamU8
 98/Z2NSZHYPrrrn13tT9ywISYXReXzUF85787+0ofugvDe8/QyBH6UhzzZc/xKVA
 t329JEjEFZZSLgxMIIa9bXoQANxkeZEGsxN6FfwvQhyIVdagLF3UvCjZl/q2NScC
 9n++s32qfUBRHetGODWc
 =6Ke9
 -----END PGP SIGNATURE-----

Merge tag 'acpi-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "These include a usual ACPICA code update (this time to upstream
  revision 20170728), a fix for a boot crash on some systems with
  Thunderbolt devices connected at boot time, a rework of the handling
  of PCI bridges when setting up device wakeup, new support for Apple
  device properties, support for DMA configurations reported via ACPI on
  ARM64, APEI-related updates, ACPI EC driver updates and assorted minor
  modifications in several places.

  Specifics:

   - Update the ACPICA code in the kernel to upstream revision 20170728
     including:
      * Alias operator handling update (Bob Moore).
      * Deferred resolution of reference package elements (Bob Moore).
      * Support for the _DMA method in walk resources (Bob Moore).
      * Tables handling update and support for deferred table
        verification (Lv Zheng).
      * Update of SMMU models for IORT (Robin Murphy).
      * Compiler and disassembler updates (Alex James, Erik Schmauss,
        Ganapatrao Kulkarni, James Morse).
      * Tools updates (Erik Schmauss, Lv Zheng).
      * Assorted minor fixes and cleanups (Bob Moore, Kees Cook, Lv
        Zheng, Shao Ming).

   - Rework the initialization of non-wakeup GPEs with method handlers
     in order to address a boot crash on some systems with Thunderbolt
     devices connected at boot time where we miss an early hotplug event
     due to a delay in GPE enabling (Rafael Wysocki).

   - Rework the handling of PCI bridges when setting up ACPI-based
     device wakeup in order to avoid disabling wakeup for bridges
     prematurely (Rafael Wysocki).

   - Consolidate Apple DMI checks throughout the tree, add support for
     Apple device properties to the device properties framework and use
     these properties for the handling of I2C and SPI devices on Apple
     systems (Lukas Wunner).

   - Add support for _DMA to the ACPI-based device properties lookup
     code and make it possible to use the information from there to
     configure DMA regions on ARM64 systems (Lorenzo Pieralisi).

   - Fix several issues in the APEI code, add support for exporting the
     BERT error region over sysfs and update APEI MAINTAINERS entry with
     reviewers information (Borislav Petkov, Dongjiu Geng, Loc Ho, Punit
     Agrawal, Tony Luck, Yazen Ghannam).

   - Fix a potential initialization ordering issue in the ACPI EC driver
     and clean it up somewhat (Lv Zheng).

   - Update the ACPI SPCR driver to extend the existing XGENE 8250
     workaround in it to a new platform (m400) and to work around an
     Xgene UART clock issue (Graeme Gregory).

   - Add a new utility function to the ACPI core to support using ACPI
     OEM ID / OEM Table ID / Revision for system identification in
     blacklisting or similar and switch over the existing code already
     using this information to this new interface (Toshi Kani).

   - Fix an xpower PMIC issue related to GPADC reads that always return
     0 without extra pin manipulations (Hans de Goede).

   - Add statements to print debug messages in a couple of places in the
     ACPI core for easier diagnostics (Rafael Wysocki).

   - Clean up the ACPI processor driver slightly (Colin Ian King, Hanjun
     Guo).

   - Clean up the ACPI x86 boot code somewhat (Andy Shevchenko).

   - Add a quirk for Dell OptiPlex 9020M to the ACPI backlight driver
     (Alex Hung).

   - Assorted fixes, cleanups and updates related to ACPI (Amitoj Kaur
     Chawla, Bhumika Goyal, Frank Rowand, Jean Delvare, Punit Agrawal,
     Ronald Tschalär, Sumeet Pawnikar)"

* tag 'acpi-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (75 commits)
  ACPI / APEI: Suppress message if HEST not present
  intel_pstate: convert to use acpi_match_platform_list()
  ACPI / blacklist: add acpi_match_platform_list()
  ACPI, APEI, EINJ: Subtract any matching Register Region from Trigger resources
  ACPI: make device_attribute const
  ACPI / sysfs: Extend ACPI sysfs to provide access to boot error region
  ACPI: APEI: fix the wrong iteration of generic error status block
  ACPI / processor: make function acpi_processor_check_duplicates() static
  ACPI / EC: Clean up EC GPE mask flag
  ACPI: EC: Fix possible issues related to EC initialization order
  ACPI / PM: Add debug statements to acpi_pm_notify_handler()
  ACPI: Add debug statements to acpi_global_event_handler()
  ACPI / scan: Enable GPEs before scanning the namespace
  ACPICA: Make it possible to enable runtime GPEs earlier
  ACPICA: Dispatch active GPEs at init time
  ACPI: SPCR: work around clock issue on xgene UART
  ACPI: SPCR: extend XGENE 8250 workaround to m400
  ACPI / LPSS: Don't abort ACPI scan on missing mem resource
  mailbox: pcc: Drop uninformative output during boot
  ACPI/IORT: Add IORT named component memory address limits
  ...
2017-09-05 12:45:03 -07:00
Rafael J. Wysocki ab271bc95b Merge branch 'intel_pstate'
* intel_pstate:
  cpufreq: intel_pstate: Shorten a couple of long names
  cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()
  cpufreq: intel_pstate: Improve IO performance with per-core P-states
  cpufreq: intel_pstate: Drop INTEL_PSTATE_HWP_SAMPLING_INTERVAL
  cpufreq: intel_pstate: Drop ->update_util from pstate_funcs
  cpufreq: intel_pstate: Do not use PID-based P-state selection
2017-09-04 00:05:42 +02:00
Rafael J. Wysocki 08a10002be Merge branch 'pm-cpufreq-sched'
* pm-cpufreq-sched:
  cpufreq: schedutil: Always process remote callback with slow switching
  cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
  cpufreq: Return 0 from ->fast_switch() on errors
  cpufreq: Simplify cpufreq_can_do_remote_dvfs()
  cpufreq: Process remote callbacks from any CPU if the platform permits
  sched: cpufreq: Allow remote cpufreq callbacks
  cpufreq: schedutil: Use unsigned int for iowait boost
  cpufreq: schedutil: Make iowait boost more energy efficient
2017-09-04 00:05:22 +02:00
Rafael J. Wysocki bd87c8fb9d Merge branch 'pm-cpufreq'
* pm-cpufreq: (33 commits)
  cpufreq: imx6q: Fix imx6sx low frequency support
  cpufreq: speedstep-lib: make several arrays static, makes code smaller
  cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
  cpufreq: dt-platdev: Drop few entries from whitelist
  cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
  ARM: ux500: don't select CPUFREQ_DT
  cpufreq: Convert to using %pOF instead of full_name
  cpufreq: Cap the default transition delay value to 10 ms
  cpufreq: dbx500: Delete obsolete driver
  mfd: db8500-prcmu: Get rid of cpufreq dependency
  cpufreq: enable the DT cpufreq driver on the Ux500
  cpufreq: Loongson2: constify platform_device_id
  cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver
  cpufreq: remove setting of policy->cpu in policy->cpus during init
  cpufreq: mediatek: add support of cpufreq to MT7622 SoC
  cpufreq: mediatek: add cleanups with the more generic naming
  cpufreq: rcar: Add support for R8A7795 SoC
  cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver
  cpufreq: s5pv210: add missing of_node_put()
  cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency
  ...
2017-09-04 00:05:13 +02:00
Toshi Kani 5e93232194 intel_pstate: convert to use acpi_match_platform_list()
Convert to use acpi_match_platform_list() for the platform check.
There is no change in functionality.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-29 01:42:49 +02:00
Rafael J. Wysocki 57ccaf3384 Merge back intel_pstate material for v4.14. 2017-08-21 01:50:20 +02:00
Sudeep Holla b20a3f3d8a cpufreq: remove setting of policy->cpu in policy->cpus during init
policy->cpu is copied into policy->cpus in cpufreq_online() before
calling into cpufreq_driver->init(). So there's no need to set the
same in the individual driver init() functions again.

This patch removes the redundant setting of policy->cpu in policy->cpus
in intel_pstate and cppc drivers.

Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-18 01:41:37 +02:00
Doug Smythies c587c79f90 cpufreq: intel_pstate: report correct CPU frequencies during trace
The intel_pstate CPU frequency scaling driver has always
calculated CPU frequency incorrectly.  Recent changes have
eliminted most of the issues, however the frequency reported
in the trace buffer, if used, is incorrect.

It remains desireable that cpu->pstate.scaling still be a nice
round number for things such as when setting max and min frequencies.
So the proposal is to just fix the reported frequency in the trace data.

Fixes what remains of [1].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96521 # [1]
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-11 01:25:53 +02:00
Rafael J. Wysocki d77d4888cb cpufreq: intel_pstate: Shorten a couple of long names
The names of the INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL symbol and
the get_target_pstate_use_cpu_load() function don't need to be so
long any more, so make them shorter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-10 01:09:16 +02:00
Rafael J. Wysocki a891283e56 cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()
Since there is only one P-state selection routine in intel_pstate
now, make intel_pstate_adjust_pstate() call it directly and drop
the target_pstate argument from that function.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-10 01:08:56 +02:00
Rafael J. Wysocki 3714c281c6 Merge v4.13 intel_pstate fixes. 2017-08-04 14:28:58 +02:00
Srinivas Pandruvada 7bde2d5001 cpufreq: intel_pstate: Improve IO performance with per-core P-states
In the current implementation, the response latency between seeing
SCHED_CPUFREQ_IOWAIT set and the actual P-state adjustment can be up
to 10ms.  It can be reduced by bumping up the P-state to the max at
the time SCHED_CPUFREQ_IOWAIT is passed to intel_pstate_update_util().
With this change, the IO performance improves significantly.

For a simple "grep -r . linux" (Here linux is the kernel source
folder) with caches dropped every time on a Broadwell Xeon workstation
with per-core P-states, the user and system time is shorter by as much
as 30% - 40%.

The same performance difference was not observed on clients that don't
support per-core P-state.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-04 13:58:57 +02:00
Rafael J. Wysocki 8a05c3115d Merge branches 'pm-cpufreq-x86', 'pm-cpufreq-docs' and 'intel_pstate'
* pm-cpufreq-x86:
  cpufreq: x86: Make scaling_cur_freq behave more as expected

* pm-cpufreq-docs:
  cpufreq: docs: Add missing cpuinfo_cur_freq description

* intel_pstate:
  cpufreq: intel_pstate: Drop ->get from intel_pstate structure
2017-08-03 20:29:24 +02:00
Viresh Kumar 674e75411f sched: cpufreq: Allow remote cpufreq callbacks
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.

One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.

This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.

The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.

The intel_pstate driver is updated to always reject remote callbacks.

This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.

The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:

- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
  OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
  next tick.

Based on initial work from Steve Muckle.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-01 14:24:53 +02:00
Rafael J. Wysocki f5c13f44c7 cpufreq: intel_pstate: Drop INTEL_PSTATE_HWP_SAMPLING_INTERVAL
After commit 62611cb912 (intel_pstate: delete scheduler hook in HWP
mode) the INTEL_PSTATE_HWP_SAMPLING_INTERVAL is not used anywhere in
the code, so drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-01 14:14:03 +02:00
Rafael J. Wysocki 22baebd489 cpufreq: intel_pstate: Drop ->get from intel_pstate structure
The ->get callback in the intel_pstate structure was mostly there
for the scaling_cur_freq sysfs attribute to work, but after commit
f8475cef90 (x86: use common aperfmperf_khz_on_cpu() to calculate
KHz using APERF/MPERF) that attribute uses arch_freq_get_on_cpu()
provided by the x86 arch code on all processors supported by
intel_pstate, so it doesn't need the ->get callback from the
driver any more.

Moreover, the very presence of the ->get callback in the intel_pstate
structure causes the cpuinfo_cur_freq attribute to be present when
intel_pstate operates in the active mode, which is bogus, because
the role of that attribute is to return the current CPU frequency
as seen by the hardware.  For intel_pstate, though, this is just an
average frequency and not really current, but computed for the
previous sampling interval (the actual current frequency may be
way different at the point this value is obtained by reading from
cpuinfo_cur_freq), and after commit 82b4e03e01 (intel_pstate: skip
scheduler hook when in "performance" mode) the value in
cpuinfo_cur_freq may be stale or just 0, depending on the driver's
operation mode.  In fact, however, on the hardware supported by
intel_pstate there is no way to read the current CPU frequency
from it, so the cpuinfo_cur_freq attribute should not be present
at all when this driver is in use.

For this reason, drop intel_pstate_get() and clear the ->get
callback pointer pointing to it, so that the cpuinfo_cur_freq is
not present for intel_pstate in the active mode any more.

Fixes: 82b4e03e01 (intel_pstate: skip scheduler hook when in "performance" mode)
Reported-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-07-27 23:51:58 +02:00
Rafael J. Wysocki c4f3f70cac cpufreq: intel_pstate: Drop ->update_util from pstate_funcs
All systems use the same P-state selection "powersave" algorithm
in the active mode if HWP is not used, so there's no need to provide
a pointer for it in struct pstate_funcs any more.

Drop ->update_util from struct pstate_funcs and make
intel_pstate_set_update_util_hook() use intel_pstate_update_util()
directly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-26 20:42:50 +02:00
Rafael J. Wysocki 9d0ef7af1f cpufreq: intel_pstate: Do not use PID-based P-state selection
All systems with a defined ACPI preferred profile that are not
"servers" have been using the load-based P-state selection algorithm
in intel_pstate since 4.12-rc1 (mobile systems and laptops have been
using it since 4.10-rc1) and no problems with it have been reported
to date.  In particular, no regressions with respect to the PID-based
P-state selection have been reported.  Also testing indicates that
the P-state selection algorithm based on CPU load is generally on par
with the PID-based algorithm performance-wise, and for some workloads
it turns out to be better than the other one, while being more
straightforward and easier to understand at the same time.

Moreover, the PID-based P-state selection algorithm in intel_pstate
is known to be unstable in some situation and generally problematic,
the issues with it are hard to address and it has become a
significant maintenance burden.

For these reasons, make intel_pstate use the "powersave" P-state
selection algorithm based on CPU load in the active mode on all
systems and drop the PID-based P-state selection code along with
all things related to it from the driver.  Also update the
documentation accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-26 20:42:50 +02:00
Viresh Kumar b8b78825a2 cpufreq: Don't set transition_latency for setpolicy drivers
The transition_latency field isn't used for drivers with ->setpolicy()
callback present and there is no point setting it from the drivers.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-26 00:15:43 +02:00
Rafael J. Wysocki ffa64d5e0d Merge branches 'intel_pstate' and 'pm-domains'
* intel_pstate:
  cpufreq: intel_pstate: Correct the busy calculation for KNL

* pm-domains:
  PM / Domains: defer dev_pm_domain_set() until genpd->attach_dev succeeds if present
2017-07-20 18:57:15 +02:00
Rafael J. Wysocki a252c258dd Merge branches 'pm-cpufreq-sched' and 'intel_pstate'
* pm-cpufreq-sched:
  cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race

* intel_pstate:
  cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
2017-07-14 13:16:16 +02:00
Srinivas Pandruvada 6e34e1f23d cpufreq: intel_pstate: Correct the busy calculation for KNL
The busy percent calculated for the Knights Landing (KNL) platform
is 1024 times smaller than the correct busy value.  This causes
performance to get stuck at the lowest ratio.

The scaling algorithm used for KNL is performance-based, but it still
looks at the CPU load to set the scaled busy factor to 0 when the
load is less than 1 percent.  In this case, since the computed load
is 1024x smaller than it should be, the scaled busy factor will
always be 0, irrespective of CPU business.

This needs a fix similar to the turbostat one in commit b2b34dfe4d
(tools/power turbostat: KNL workaround for %Busy and Avg_MHz).

For this reason, add one more callback to processor-specific
callbacks to specify an MPERF multiplier represented by a number of
bit positions to shift the value of that register to the left to
copmensate for its rate difference with respect to the TSC.  This
shift value is used during CPU busy calculations.

Fixes: ffb810563c (intel_pstate: Avoid getting stuck in high P-states when idle)
Reported-and-tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-14 03:01:28 +02:00
Srinivas Pandruvada d4436c0dba cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
When the minimum performance limit percentage is set to the power-up
default, it is possible that minimum performance ratio is off by one.

In the set_policy() callback the minimum ratio is calculated by
applying global.min_perf_pct to turbo_ratio and rounding up, but the
power-up default global.min_perf_pct is already rounded up to the
next percent in min_perf_pct_min().  That results in two round up
operations, so for the default min_perf_pct one of them is not
required.

It is better to remove rounding up in min_perf_pct_min() as this
matches the displayed min_perf_pct prior to commit c5a2ee7dde
(cpufreq: intel_pstate: Active mode P-state limits rework) in 4.12.

For example on a platform with max turbo ratio of 37 and minimum
ratio of 10, the min_perf_pct resulted in 28 with the above commit.
Before this commit it was 27 and it will be the same after this
change.

Fixes: 1a4fe38add (cpufreq: intel_pstate: Remove max/min fractions to limit performance)
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-12 14:39:10 +02:00
Rafael J. Wysocki 15d56b3921 Merge branches 'pm-domains', 'pm-sleep' and 'pm-cpufreq'
* pm-domains:
  PM / Domains: provide pm_genpd_poweroff_noirq() stub
  Revert "PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device"

* pm-sleep:
  PM / sleep: constify attribute_group structures

* pm-cpufreq:
  cpufreq: intel_pstate: constify attribute_group structures
  cpufreq: cpufreq_stats: constify attribute_group structures
2017-07-10 22:45:16 +02:00
Arvind Yadav 106c9c77ed cpufreq: intel_pstate: constify attribute_group structures
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
  15197	   2552	     40	  17789	   457d	drivers/cpufreq/intel_pstate.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
  15261	   2488	     40	  17789	   457d	drivers/cpufreq/intel_pstate.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-04 22:03:13 +02:00
Rafael J. Wysocki f1c7842e5f Merge branches 'pm-cpufreq', 'intel_pstate' and 'pm-cpuidle'
* pm-cpufreq:
  cpufreq / CPPC: Initialize policy->min to lowest nonlinear performance
  cpufreq: sfi: make freq_table static
  cpufreq: exynos5440: Fix inconsistent indenting
  cpufreq: imx6q: imx6ull should use the same flow as imx6ul
  cpufreq: dt: Add support for hi3660

* intel_pstate:
  cpufreq: Update scaling_cur_freq documentation
  cpufreq: intel_pstate: Clean up after performance governor changes
  intel_pstate: skip scheduler hook when in "performance" mode
  intel_pstate: delete scheduler hook in HWP mode
  x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
  cpufreq: intel_pstate: Remove max/min fractions to limit performance
  x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"

* pm-cpuidle:
  cpuidle: menu: allow state 0 to be disabled
  intel_idle: Use more common logging style
  x86/ACPI/cstate: Allow ACPI C1 FFH MWAIT use on AMD systems
  ARM: cpuidle: Support asymmetric idle definition
2017-07-03 14:21:18 +02:00
Rafael J. Wysocki 16b5b09240 Merge branch 'pm-tools'
* pm-tools:
  cpupower: Add support for new AMD family 0x17
  cpupower: Fix bug where return value was not used
  tools/power turbostat: update version number
  tools/power turbostat: decode MSR_IA32_MISC_ENABLE only on Intel
  tools/power turbostat: stop migrating, unless '-m'
  tools/power turbostat: if  --debug, print sampling overhead
  tools/power turbostat: hide SKL counters, when not requested
  intel_pstate: use updated msr-index.h HWP.EPP values
  tools/power x86_energy_perf_policy: support HWP.EPP
  x86: msr-index.h: fix shifts to ULL results in HWP macros.
  x86: msr-index.h: define HWP.EPP values
  x86: msr-index.h: define EPB mid-points
2017-07-03 14:17:16 +02:00
Rafael J. Wysocki fab24dcc39 cpufreq: intel_pstate: Clean up after performance governor changes
After commit 82b4e03e01 (intel_pstate: skip scheduler hook when in
"performance" mode) get_target_pstate_use_performance() and
get_target_pstate_use_cpu_load() are never called if scaling_governor
is "performance", so drop the CPUFREQ_POLICY_PERFORMANCE checks from
them as they will never trigger anyway.

Moreover, the documentation needs to be updated to reflect the change
made by the above commit, so do that too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
2017-06-29 23:25:15 +02:00
Len Brown 82b4e03e01 intel_pstate: skip scheduler hook when in "performance" mode
When the governor is set to "performance", intel_pstate does not
need the scheduler hook for doing any calculations.  Under these
conditions, its only purpose is to continue to maintain
cpufreq/scaling_cur_freq.

The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

So in "performance" governor mode, the scheduler hook can be skipped.
This applies to both in Software and Hardware P-state control modes.

Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 01:47:34 +02:00
Len Brown 62611cb912 intel_pstate: delete scheduler hook in HWP mode
The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

In HWP mode, maintaining that value was the sole purpose of
the scheduler hook, intel_pstate_update_util_hwp(),
so it can now be removed.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 01:47:33 +02:00
Srinivas Pandruvada 1a4fe38add cpufreq: intel_pstate: Remove max/min fractions to limit performance
In the current model the max/min perf limits are a fraction of current
user space limits to the allowed max_freq or 100% for global limits.
This results in wrong ratio limits calculation because of rounding
issues for some user space limits.

Initially we tried to solve this issue by issue by having more shift
bits to increase precision. Still there are isolated cases where we still
have error.

This can be avoided by using ratios all together. Since the way we get
cpuinfo.max_freq is by multiplying scaling factor to max ratio, we can
easily keep the max/min ratios in terms of ratios and not fractions.

For example:
if the max ratio = 36
cpuinfo.max_freq = 36 * 100000 = 3600000

Suppose user space sets a limit of 1200000, then we can calculate
max ratio limit as
= 36 * 1200000 / 3600000
= 12
This will be correct for any user limits.

The other advantage is that, we don't need to do any calculation in the
fast path as ratio limit is already calculated via set_policy() callback.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-24 01:48:12 +02:00
Rafael J. Wysocki 57caf4ec2b cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
Commit c5a2ee7dde (cpufreq: intel_pstate: Active mode P-state
limits rework) incorrectly assumed that pstate.turbo_pstate would
always be nonzero for CPU0 in min_perf_pct_min() if
cpufreq_register_driver() had succeeded which may not be the case
in virtualized environments.

If that assumption doesn't hold, it leads to an early crash on boot
in intel_pstate_register_driver(), so add a sanity check to
min_perf_pct_min() to prevent the crash from happening.

Fixes: c5a2ee7dde (cpufreq: intel_pstate: Active mode P-state limits rework)
Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-05 14:51:18 +02:00
Rafael J. Wysocki a32f80b30d Merge branch 'utilities' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull power management utilities updates from Len Brown.

* 'utilities' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  intel_pstate: use updated msr-index.h HWP.EPP values
  tools/power x86_energy_perf_policy: support HWP.EPP
  x86: msr-index.h: fix shifts to ULL results in HWP macros.
  x86: msr-index.h: define HWP.EPP values
  x86: msr-index.h: define EPB mid-points
2017-05-16 03:15:27 +02:00
Len Brown 3cedbc5a6d intel_pstate: use updated msr-index.h HWP.EPP values
intel_pstate exports sysfs attributes for setting and observing HWP.EPP.
These attributes use strings to describe 4 operating states, and
inside the driver, these strings are mapped to numerical register
values.

The authorative mapping between the strings and numerical HWP.EPP values
are now globally defined in msr-index.h, replacing the out-dated
mapping that were open-coded into intel_pstate.c

new old string
--- --- ------
  0   0 performance
128  64 balance_performance
192 128 balance_power
255 192 power

Note that the HW and BIOS default value on most system is 128,
which intel_pstate will now call "balance_performance"
while it used to call it "balance_power".

Signed-off-by: Len Brown <len.brown@intel.com>
2017-05-11 21:27:53 -04:00
Rafael J. Wysocki 1b72e7fd30 cpufreq: schedutil: Use policy-dependent transition delays
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).

That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.

Make intel_pstate set transition_delay_us to 500.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-17 18:37:27 +02:00
Box, David E 630e57573e cpufreq: intel_pstate: Add support for Gemini Lake
Use same parameters as INTEL_FAM6_ATOM_GOLDMONT to enable
Gemini Lake.

Signed-off-by: Box, David E <david.e.box@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-29 22:45:32 +02:00
Rafael J. Wysocki b02aabe8ab cpufreq: intel_pstate: Eliminate intel_pstate_get_min_max()
Some computations in intel_pstate_get_min_max() are not necessary
and one of its two callers doesn't even use the full result.

First off, the fixed-point value of cpu->max_perf represents a
non-negative number between 0 and 1 inclusive and cpu->min_perf
cannot be greater than cpu->max_perf.  It is not necessary to check
those conditions every time the numbers in question are used.

Moreover, since intel_pstate_max_within_limits() only needs the
upper boundary, it doesn't make sense to compute the lower one in
there and returning min and max from intel_pstate_get_min_max()
via pointers doesn't look particularly nice.

For the above reasons, drop intel_pstate_get_min_max(), add a helper
to get the base P-state for min/max computations and carry out them
directly in the previous callers of intel_pstate_get_min_max().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28 23:12:17 +02:00