linux/kernel
Jeff Mahoney 9ab020cf07 taskstats: use better ifdef for alignment
Commit 4be2c95d ("taskstats: pad taskstats netlink response for aligment
issues on ia64") added a null field to align the taskstats structure but
the discussion centered around ia64.  The issue exists on other platforms
with inefficient unaligned access and adding them piecemeal would be an
unmaintainable mess.

This patch uses Dave Miller's suggestion of using a combination of
CONFIG_64BIT && !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
whether alignment is needed.

Note that this will cause breakage on those platforms with applications
like iotop which had hard-coded offsets into the packet to access the
taskstats structure.

The message seen on systems without the alignment fixes looks like: kernel
unaligned access to 0xe000023879dca9bc, ip=0xa000000100133d10

The addresses may vary but resolve to locations inside __delayacct_add_tsk.

iotop makes what I'd call unreasonable assumptions about the contents of a
netlink genetlink packet containing generic attributes.  They're typed and
have headers that specify value lengths, so the client can (should)
identify and skip the ones the client doesn't understand.

The kernel, as of version 2.6.36, presented a packet like so:
+--------------------------------+
| genlmsghdr - 4 bytes           |
+--------------------------------+
| NLA header - 4 bytes           | /* Aggregate header */
+-+------------------------------+
| | NLA header - 4 bytes         | /* PID header */
| +------------------------------+
| | pid/tgid   - 4 bytes         |
| +------------------------------+
| | NLA header - 4 bytes         | /* stats header */
| + -----------------------------+ <- oops. aligned on 4 byte boundary
| | struct taskstats - 328 bytes |
+-+------------------------------+

The iotop code expects that the kernel will behave as it did then,
assuming that the packet format is set in stone.  The format is set in
stone, but the packet offsets are not.  There's nothing in the packet
format that guarantees that the packet will be sent in exactly the same
way.  The attribute contents are set (or versioned) and the aggregate
contents are set but they can be anywhere in the packet.

The issue here isn't that an unaligned structure gets passed to userspace,
it's that the NLA infrastructure has something of a weakness: The 4 byte
attribute header may force the payload to be unaligned.  The taskstats
structure is created at an unaligned location and then 64-bit values are
operated on inside the kernel, so the unaligned access warnings gets
spewed everywhere.

It's possible to use the unaligned access API to operate on the structure
in the kernel but it seems like a wasted effort to work around userspace
code that isn't following the packet format.  Any new additions would also
need the be worked around.  It's a maintenance nightmare.

The conclusion of the earlier discussion seemed to be "ok fine, if we have
to break it, don't break it on arches that don't have the problem." Dave
pointed out that the unaligned access problem doesn't only exist on ia64,
but also on other 64-bit arches that don't have efficient unaligned access
and it should be fixed there as well.  The committed version of the patch
and this addition keep with the conclusion of that discussion not to break
it unnecessarily, which the pid padding and the packet padding fixes did
do.  x86_64 and powerpc don't suffer this problem so they shouldn't suffer
the solution.  Other 64-bit architectures do and will, though.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David S. Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Florian Mickler <florian@mickler.org>
Cc: Guillaume Chazarain <guichaz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:19 -08:00
..
debug kdb: fix crash when KDB_BASE_CMD_MAX is exceeded 2010-11-17 13:54:57 -06:00
gcov llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
irq sched: Constify function scope static struct sched_param usage 2011-01-07 15:55:45 +01:00
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 2011-01-10 08:14:53 -08:00
time Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
trace Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-11 11:02:13 -08:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
Makefile kernel: clean up USE_GENERIC_SMP_HELPERS 2011-01-13 08:03:08 -08:00
acct.c pass a struct path to vfs_statfs 2010-08-09 16:48:42 -04:00
async.c async: use workqueue for worker pool 2010-07-14 11:29:46 +02:00
audit.c audit: Use rcu for task lookup protection 2010-10-30 08:45:42 -04:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
audit_tree.c in untag_chunk() we need to do alloc_chunk() a bit earlier 2010-10-30 02:18:32 -04:00
audit_watch.c audit: make functions static 2010-10-30 01:42:19 -04:00
auditfilter.c Audit: add support to match lsm labels on user audit messages 2010-10-30 01:41:57 -04:00
auditsc.c audit mmap 2010-10-30 08:45:43 -04:00
backtracetest.c
bounds.c
capability.c
cgroup.c fs: dcache reduce branches in lookup path 2011-01-07 17:50:28 +11:00
cgroup_freezer.c cgroup_freezer: update_freezer_state() does incorrect state transitions 2010-10-27 18:03:08 -07:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-14 16:08:45 -07:00
configs.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
cpu.c Merge branches 'x86-alternatives-for-linus', 'x86-fpu-for-linus', 'x86-hwmon-for-linus', 'x86-paravirt-for-linus', 'core-locking-for-linus' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-06 11:11:50 -08:00
cpuset.c convert cgroup and cpuset 2010-10-29 04:17:06 -04:00
cred.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-11 11:02:13 -08:00
extable.c
fork.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
freezer.c Freezer: Fix a race during freezing of TASK_STOPPED tasks 2010-12-24 15:02:40 +01:00
futex.c futex: Add futex_q static initializer 2010-11-10 15:01:34 +01:00
futex_compat.c futex: Address compiler warnings in exit_robust_list 2010-11-10 13:27:50 +01:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-09 18:57:24 -07:00
hrtimer.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
hung_task.c lockup detector: Fix grammar by adding a missing "to" in the comments 2010-08-17 09:11:52 +02:00
hw_breakpoint.c perf: Dynamic pmu types 2010-12-16 11:36:43 +01:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump label: Make arch_jump_label_text_poke_early() optional 2010-10-29 12:56:13 -04:00
kallsyms.c Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking" 2010-11-19 11:54:40 -08:00
kexec.c use clear_page()/copy_page() in favor of memset()/memcpy() on whole pages 2010-10-26 16:52:13 -07:00
kfifo.c kfifo: fix scatterlist usage 2010-10-01 10:50:58 -07:00
kmod.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
kprobes.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c sched: Constify function scope static struct sched_param usage 2011-01-07 15:55:45 +01:00
latencytop.c fs/proc/base.c, kernel/latencytop.c: convert sprintf_symbol() to %ps 2011-01-13 08:03:16 -08:00
lockdep.c lockdep: Check the depth of subclass 2010-10-18 18:44:26 +02:00
lockdep_internals.h
lockdep_proc.c locking, lockdep: Convert sprintf_symbol to %pS 2010-11-10 10:23:58 +01:00
lockdep_states.h
module.c module: Move RO/NX module protection to after ftrace module update 2010-12-23 09:56:00 -05:00
mutex-debug.c
mutex-debug.h
mutex.c mutexes, sched: Introduce arch_mutex_cpu_relax() 2010-11-26 15:05:34 +01:00
mutex.h
notifier.c
ns_cgroup.c cgroup: notify ns_cgroup deprecated 2010-10-27 18:03:09 -07:00
nsproxy.c
padata.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2010-08-04 15:23:14 -07:00
panic.c lib/bug.c: add oops end marker to WARN implementation 2010-08-11 08:59:22 -07:00
params.c param: locking for kernel parameters 2010-08-11 23:04:20 +09:30
perf_event.c perf_events: Add perf_event_time() 2011-01-07 15:08:51 +01:00
pid.c Add RCU check for find_task_by_vpid(). 2010-08-19 17:18:02 -07:00
pid_namespace.c
pm_qos_params.c PM / PM QoS: Fix reversed min and max 2010-11-15 22:45:22 +01:00
posix-cpu-timers.c posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call 2010-11-10 13:07:06 +01:00
posix-timers.c posix-timers: Annotate lock_timer() 2010-10-21 17:30:06 +02:00
printk.c printk: use RCU to prevent potential lock contention in kmsg_dump 2011-01-13 08:03:09 -08:00
profile.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
ptrace.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu 2010-10-07 09:43:11 +02:00
rcutiny.c rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCU 2010-11-29 22:01:55 -08:00
rcutiny_plugin.h rcu: Distinguish between boosting and boosted 2010-11-29 22:01:56 -08:00
rcutorture.c rcu: add priority-inversion testing to rcutorture 2010-10-07 10:41:08 -07:00
rcutree.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
rcutree.h rcu: limit rcu_node leaf-level fanout 2010-12-17 12:34:20 -08:00
rcutree_plugin.h rcu: increase synchronize_sched_expedited() batching 2010-12-17 12:34:08 -08:00
rcutree_trace.c rcu,cleanup: simplify the code when cpu is dying 2010-11-29 22:01:58 -08:00
relay.c Clean up relay_alloc_page_array() slightly by using vzalloc rather than vmalloc and memset 2010-11-05 08:21:34 -07:00
res_counter.c
resource.c resources: add arch hook for preventing allocation in reserved areas 2010-12-17 10:01:09 -08:00
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c rtmutex-tester: make it build without BKL 2010-10-19 11:29:56 +02:00
rtmutex.c
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c sched: Fix strncmp operation 2011-01-07 15:55:10 +01:00
sched_autogroup.c sched: Mark autogroup_init() __init 2011-01-07 15:54:36 +01:00
sched_autogroup.h sched: Add 'autogroup' scheduling feature: automated per session task groups 2010-11-30 16:03:35 +01:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_cpupri.h sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_debug.c sched: Add 'autogroup' scheduling feature: automated per session task groups 2010-11-30 16:03:35 +01:00
sched_fair.c sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight 2010-12-19 16:36:30 +01:00
sched_features.h sched: Rewrite tg_shares_up) 2010-11-18 13:27:46 +01:00
sched_idletask.c
sched_rt.c sched: Implement on-demand (active) cfs_rq list 2010-11-18 13:27:47 +01:00
sched_stats.h sched_stat: Update sched_info_queue/dequeue() code comments 2010-10-24 13:29:01 +02:00
sched_stoptask.c sched: Fix cross-sched-class wakeup preemption 2010-11-11 14:37:23 +01:00
seccomp.c
semaphore.c
signal.c signals: annotate lock context change on ptrace_stop() 2010-10-27 18:03:12 -07:00
smp.c kernel: clean up USE_GENERIC_SMP_HELPERS 2011-01-13 08:03:08 -08:00
softirq.c kernel: clean up USE_GENERIC_SMP_HELPERS 2011-01-13 08:03:08 -08:00
spinlock.c
srcu.c rcu: Make synchronize_srcu_expedited() fast if running readers 2010-11-29 22:02:40 -08:00
stacktrace.c
stop_machine.c stop_machine: convert cpu notifier to return encapsulate errno value 2010-10-26 16:52:15 -07:00
sys.c kmsg_dump: add kmsg_dump() calls to the reboot, halt, poweroff and emergency_restart paths 2011-01-13 08:03:07 -08:00
sys_ni.c powerpc: define a compat_sys_recv cond_syscall 2010-09-23 17:03:55 +10:00
sysctl.c sysctl: remove obsolete comments 2011-01-13 08:03:18 -08:00
sysctl_binary.c x86, NMI: Add back unknown_nmi_panic and nmi_watchdog sysctls 2010-12-10 00:01:06 +01:00
sysctl_check.c sysctl: min/max bounds are optional 2010-10-15 14:42:24 -07:00
taskstats.c taskstats: use better ifdef for alignment 2011-01-13 08:03:19 -08:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c time: Kill off CONFIG_GENERIC_TIME 2010-07-27 12:40:54 +02:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-06 10:42:43 -08:00
tracepoint.c jump_label: Use more consistent naming 2010-10-18 19:58:56 +02:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c
up.c
user-return-notifier.c
user.c fix freeing user_struct in user cache 2010-12-29 11:31:38 -08:00
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
utsname.c
utsname_sysctl.c
wait.c docbook: add more wait/wake/completion to device-drivers docbook 2010-10-26 17:32:41 -07:00
watchdog.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
workqueue.c workqueue: allow chained queueing during destruction 2010-12-20 19:32:04 +01:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00