ip_vs_lblc_table and ip_vs_lblcr_table, and code that uses them
are unnecessary when CONFIG_SYSCTL is undefined.
Signed-off-by: Simon Horman <horms@verge.net.au>
Much of ip_vs_leave() is unnecessary if CONFIG_SYSCTL is undefined.
I tried an approach of breaking the now #ifdef'ed portions out
into a separate function. However this appeared to grow the
compiled code on x86_64 by about 200 bytes in the case where
CONFIG_SYSCTL is defined. So I have gone with the simpler though
less elegant #ifdef'ed solution for now.
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_conntrack_enabled() becomes a noop when CONFIG_SYSCTL is undefined.
In preparation for not including sysctl_conntrack in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_lblc{r}_expiration in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_expire_quiescent_template in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_expire_nodest_conn in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_sync_ver in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_sync_threshold in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_nat_icmp_send in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
In preparation for not including sysctl_snat_reroute in
struct netns_ipvs when CONFIG_SYCTL is not defined.
Signed-off-by: Simon Horman <horms@verge.net.au>
Rename ip_vs_new_estimator to ip_vs_start_estimator
and ip_vs_kill_estimator to ip_vs_stop_estimator to better
match their logic.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Move the estimator reading from estimation_timer to user
context. ip_vs_read_estimator() will be used to decode the rate
values. As the decoded rates are not set by estimation timer
there is no need to reset them in ip_vs_zero_stats.
There is no need ip_vs_new_estimator() to encode stats
to rates, if the destination is in trash both the stats and the
rates are inactive.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Remove ustats_seq, IPVS_STAT_INC and IPVS_STAT_ADD
because they are not used. They were replaced with u64_stats.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Currently, the new percpu counters are not zeroed and
the zero commands do not work as expected, we still show the old
sum of percpu values. OTOH, we can not reset the percpu counters
from user context without causing the incrementing to use old
and bogus values.
So, as Eric Dumazet suggested fix that by moving all overhead
to stats reading in user context. Do not introduce overhead in
timer context (estimator) and incrementing (packet handling in
softirqs).
The new ustats0 field holds the zero point for all
counter values, the rates always use 0 as base value as before.
When showing the values to user space just give the difference
between counters and the base values. The only drawback is that
percpu stats are not zeroed, they are accessible only from /proc
and are new interface, so it should not be a compatibility problem
as long as the sum stats are correct after zeroing.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
The global tot_stats contains cpustats field just like the
stats for dest and svc, so better use it to simplify the usage
in estimation_timer. As tot_stats is registered as estimator
we can remove the special ip_vs_read_cpu_stats call for
tot_stats. Fix ip_vs_read_cpu_stats to be called under
stats lock because it is still used as synchronization between
estimation timer and user context (the stats readers).
Also, make sure ip_vs_stats_percpu_show reads properly
the u64 stats from user context.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Remove include/net/netns/ip_vs.h because it depends on
structures from include/net/ip_vs.h. As ipvs is pointer in
struct net it is better to move struct netns_ipvs into
include/net/ip_vs.h, so that we can easily use other structures
in struct netns_ipvs.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
There's no sense to 'ct = ct = ' in ip_vs_notrack(). Just assign
nf_ct_get()'s return value directly to the pointer variable 'ct' once.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
The semantic patch that makes this output is available
in scripts/coccinelle/api/memdup.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_read_cpu_stats is called only from timer, so
no need for _bh locks.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Restore the previous behaviour to lookup for fwmark
service only when fwmark is non-null. This saves only CPU.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
This adds defines for the app selector values currently
defined in the IEEE 802.1Qaz specification.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a few spelling errors in dcbnl.h.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the lower device has offloading capabilities, the packets checksums
are not computed. That leads to have any macvlan port in bridge mode to
not work because the packets are dropped due to a bad checksum.
If the macvlan is in bridge mode, the packet is forwarded to another
macvlan port and reach the network stack where it looks for a checksum
but this one was not computed due to the offloading of the lower device.
In this case, we have to set the packet with CHECKSUM_UNNECESSARY
when it is forwarded to a bridged port and restore the previous value of
ip_summed when the packet goes to the lowerdev.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Andrian Nord <nightnord@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check for bonding master and refuse to use that.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Robert Love <robert.w.love@intel.com>
Acked-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
QQ2440 is only another non-ISA board using CS89x0. This patch adds the
minimum bits required to make QQ2440 work with CS89x0.
Signed-off-by: Domenico Andreoli <cavokz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CS89x0_NONISA_IRQ is selected by all those non-ISA boards which use
CS89x0. This patch only cleans the last bits left after its introduction.
Signed-off-by: Domenico Andreoli <cavokz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hibernate image size autotuning mechanism sets the default
image size to 5/2 of the total system RAM, but it is reported
that on some systems device drivers allocate substantial
amounts of memory during suspend and the creation of the image
fails as a result (too little memory is preallocated).
Modify the autotuning mechanism to use 1/3 instead of 2/5 of RAM
as the default image size, which is reported to be sufficient for
the affected systems.
References: https://bugzilla.kernel.org/show_bug.cgi?id=30482
Reported-and-tested-by: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Some subsystems need to carry out suspend/resume and shutdown
operations with one CPU on-line and interrupts disabled. The only
way to register such operations is to define a sysdev class and
a sysdev specifically for this purpose which is cumbersome and
inefficient. Moreover, the arguments taken by sysdev suspend,
resume and shutdown callbacks are practically never necessary.
For this reason, introduce a simpler interface allowing subsystems
to register operations to be executed very late during system suspend
and shutdown and very early during resume in the form of
strcut syscore_ops objects.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
I have a machine where entering deep C-states broke.
pm_qos was a hot candidate, but I couldn't find any way to double
check without the need of recompiling.
While in this case it was a driver bug (ath9k):
https://bugzilla.kernel.org/show_bug.cgi?id=27532
powertop or others may want to read out cpu_dma_latency
restrictions which could be the cause of preventing a machine
entering deeper C-states.
Output with this patch:
# default value of 2000 * USEC_PER_SEC (0x77359400)
cat /dev/network_latency |hexdump
0000000 9400 7735
0000004
# value of 55 us which is the reason for not entering C2
cat /dev/cpu_dma_latency |hexdump
0000000 0037 0000
0000004
There is no reason to hide this info -> make pm_qos files readable.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
opp_find_freq_exact() documentation has is_available instead
of available. This also fixes warning with the kernel-doc:
scripts/kernel-doc drivers/base/power/opp.c >/dev/null
Warning(drivers/base/power/opp.c:246): No description found for parameter 'available'
Warning(drivers/base/power/opp.c:246): Excess function parameter 'is_available' description in 'opp_find_freq_exact'
Signed-off-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The code handling system-wide power transitions (eg. suspend-to-RAM)
can in theory execute callbacks provided by the device's bus type,
device type and class in each phase of the power transition. In
turn, the runtime PM core code only calls one of those callbacks at
a time, preferring bus type callbacks to device type or class
callbacks and device type callbacks to class callbacks.
It seems reasonable to make them both behave in the same way in that
respect. Moreover, even though a device may belong to two subsystems
(eg. bus type and device class) simultaneously, in practice power
management callbacks for system-wide power transitions are always
provided by only one of them (ie. if the bus type callbacks are
defined, the device class ones are not and vice versa). Thus it is
possible to modify the code handling system-wide power transitions
so that it follows the core runtime PM code (ie. treats the
subsystem callbacks as mutually exclusive).
On the other hand, the core runtime PM code will choose to execute,
for example, a runtime suspend callback provided by the device type
even if the bus type's struct dev_pm_ops object exists, but the
runtime_suspend pointer in it happens to be NULL. This is confusing,
because it may lead to the execution of callbacks from different
subsystems during different operations (eg. the bus type suspend
callback may be executed during runtime suspend of the device, while
the device type callback will be executed during system suspend).
Make all of the power management code treat subsystem callbacks in
a consistent way, such that:
(1) If the device's type is defined (eg. dev->type is not NULL)
and its pm pointer is not NULL, the callbacks from dev->type->pm
will be used.
(2) If dev->type is NULL or dev->type->pm is NULL, but the device's
class is defined (eg. dev->class is not NULL) and its pm pointer
is not NULL, the callbacks from dev->class->pm will be used.
(3) If dev->type is NULL or dev->type->pm is NULL and dev->class is
NULL or dev->class->pm is NULL, the callbacks from dev->bus->pm
will be used provided that both dev->bus and dev->bus->pm are
not NULL.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Kevin Hilman <khilman@ti.com>
Reasoning-sounds-sane-to: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
'n' defaults are pretty pointless and actually bogus when used with
prompt-less config options.
The "bool"/"default y" pair with no prompt can be expressed more
compactly using def_bool.
[rjw: Rebased on top of earlier patches modifying this file.]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The platform bus type is often used to handle Systems-on-a-Chip (SoC)
where all devices are represented by objects of type struct
platform_device. In those cases the same "platform" device driver
may be used with multiple different system configurations, but the
actions needed to put the devices it handles into a low-power state
and back into the full-power state may depend on the design of the
given SoC. The driver, however, cannot possibly include all the
information necessary for the power management of its device on all
the systems it is used with. Moreover, the device hierarchy in its
current form also is not suitable for representing this kind of
information.
The patch below attempts to address this problem by introducing
objects of type struct dev_power_domain that can be used for
representing power domains within a SoC. Every struct
dev_power_domain object provides a sets of device power
management callbacks that can be used to perform what's needed for
device power management in addition to the operations carried out by
the device's driver and subsystem.
Namely, if a struct dev_power_domain object is pointed to by the
pwr_domain field in a struct device, the callbacks provided by its
ops member will be executed in addition to the corresponding
callbacks provided by the device's subsystem and driver during all
power transitions.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-and-acked-by: Kevin Hilman <khilman@ti.com>
The variable pm_flags is used to prevent APM from being enabled
along with ACPI, which would lead to problems. However, acpi_init()
is always called before apm_init() and after acpi_init() has
returned, it is known whether or not ACPI will be used. Namely, if
acpi_disabled is not set after acpi_init() has returned, this means
that ACPI is enabled. Thus, it is sufficient to check acpi_disabled
in apm_init() to prevent APM from being enabled in parallel with
ACPI.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
The dpm_prepare() function increments the runtime PM reference
counters of all devices to prevent pm_runtime_suspend() from
executing subsystem-level callbacks. However, this was supposed to
guard against a specific race condition that cannot happen, because
the power management workqueue is freezable, so pm_runtime_suspend()
can only be called synchronously during system suspend and we can
rely on subsystems and device drivers to avoid doing that
unnecessarily.
Make dpm_prepare() drop the runtime PM reference to each device
after making sure that runtime resume is not pending for it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Kevin Hilman <khilman@ti.com>
CONFIG_PM_SLEEP_ADVANCED_DEBUG is not used any more, so drop it
and CONFIG_CAN_PM_TRACE need not depend on EXPERIMENTAL, so remove
that dependency.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
After redefining CONFIG_PM to depend on (CONFIG_PM_SLEEP ||
CONFIG_PM_RUNTIME) the CONFIG_PM_OPS option is redundant and can be
replaced with CONFIG_PM.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reorder configuration options in kernel/power/Kconfig so that
the options depended on are at the top of the list.
This patch doesn't introduce any functional changes.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
From the users' point of view CONFIG_PM is really only used for
making it possible to set CONFIG_SUSPEND, CONFIG_HIBERNATION,
CONFIG_PM_RUNTIME and (surprisingly enough) CONFIG_XEN_SAVE_RESTORE
(CONFIG_PM_OPP also depends on CONFIG_PM, but quite artificially).
However, both CONFIG_SUSPEND and CONFIG_HIBERNATION require platform
support (independent of CONFIG_PM) and it is not quite obvious that
CONFIG_PM has to be set for CONFIG_XEN_SAVE_RESTORE to be available.
Thus, from the users' point of view, it would be more logical to
automatically select CONFIG_PM if any of the above options depending
on it are set.
Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME),
which will cause it to be selected when any of CONFIG_SUSPEND,
CONFIG_HIBERNATION, CONFIG_PM_RUNTIME, CONFIG_XEN_SAVE_RESTORE is
set and will clarify its meaning.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If direct references to pm_flags are removed from drivers/acpi/bus.c,
CONFIG_ACPI will not need to depend on CONFIG_PM any more. Make that
happen.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
Currently, wakeup sysfs attributes are created for all devices,
regardless of whether or not they are wakeup-capable. This is
excessive and complicates wakeup device identification from user
space (i.e. to identify wakeup-capable devices user space has to read
/sys/devices/.../power/wakeup for all devices and see if they are not
empty).
Fix this issue by avoiding to create wakeup sysfs files for devices
that cannot wake up the system from sleep states (i.e. whose
power.can_wakeup flags are unset during registration) and modify
device_set_wakeup_capable() so that it adds (or removes) the relevant
sysfs attributes if a device's wakeup capability status is changed.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
A subsequent patch will modify device_set_wakeup_capable() in such
a way that it will call functions which may sleep and therefore it
shouldn't be called under spinlocks. In preparation to that, modify
usb_set_device_state() to avoid calling device_set_wakeup_capable()
under device_state_lock.
Tested-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
printk()s without a priority level default to KERN_WARNING. To reduce
noise at KERN_WARNING, this patch sets the priority level appriopriately
for unleveled printks()s. This should be useful to folks that look at
dmesg warnings closely.
Changed these messages to pr_info().
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Since pm_save_wakeup_count() has just been changed to clear
events_check_enabled unconditionally before checking if there are
any new wakeup events registered since the last read from
/sys/power/wakeup_count, the detection of wakeup events during
suspend may be disabled, after it's been enabled, by writing a
"wrong" value back to /sys/power/wakeup_count. For this reason,
it is not necessary to update events_check_enabled in
pm_get_wakeup_count() any more.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
According to Documentation/ABI/testing/sysfs-power, the
/sys/power/wakeup_count interface should only make the kernel react
to wakeup events during suspend if the last write to it has been
successful. However, if /sys/power/wakeup_count is written to two
times in a row, where the first write is successful and the second
is not, the kernel will still react to wakeup events during suspend
due to a bug in pm_save_wakeup_count().
Fix the bug by making pm_save_wakeup_count() clear
events_check_enabled unconditionally before checking if there are
any new wakeup events registered since the previous read from
/sys/power/wakeup_count.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The memory barrier in wakeup_source_deactivate() is supposed to
prevent the callers of pm_wakeup_pending() and pm_get_wakeup_count()
from seeing the new value of events_in_progress (0, in particular)
and the old value of event_count at the same time. However, if
wakeup_source_deactivate() is executed by CPU0 and, for instance,
pm_wakeup_pending() is executed by CPU1, where both processors can
reorder operations, the memory barrier in wakeup_source_deactivate()
doesn't affect CPU1 which can reorder reads. In that case CPU1 may
very well decide to fetch event_count before it's modified and
events_in_progress after it's been updated, so pm_wakeup_pending()
may fail to detect a wakeup event. This issue can be addressed by
using a single atomic variable to store both events_in_progress
and event_count, so that they can be updated together in a single
atomic operation.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>