Commit Graph

1791 Commits

Author SHA1 Message Date
Paul E. McKenney 98059b9861 rcu: Separately compile large rcu_segcblist functions
This commit creates a new kernel/rcu/rcu_segcblist.c file that
contains non-trivial segcblist functions.  Trivial functions
remain as static inline functions in kernel/rcu/rcu_segcblist.h

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2017-05-02 07:21:02 -07:00
Paul E. McKenney d160a727c4 srcu: Make SRCU be built by default
SRCU is optional, and included only if there is a "select SRCU" in effect.
However, we now have Tiny SRCU, so this commit defaults CONFIG_SRCU=y.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-24 08:36:02 -07:00
Viresh Kumar 1b3b3b49b9 init/main: properly align the multi-line comment
Add a tab before it to follow standard practices. Also add the missing
full stop '.'.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-04-24 13:14:21 +02:00
Viresh Kumar 6623f1c615 init/main: Fix double "the" in comment
s/the\ the/the

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-04-24 13:14:21 +02:00
Paul E. McKenney 677df9d461 srcu: Fix Kconfig botch when SRCU not selected
If the CONFIG_SRCU option is not selected, for example, when building
arch/tile allnoconfig, the following build errors appear:

	kernel/rcu/tree.o: In function `srcu_online_cpu':
	tree.c:(.text+0x4248): multiple definition of `srcu_online_cpu'
	kernel/rcu/srcutree.o:srcutree.c:(.text+0x2120): first defined here
	kernel/rcu/tree.o: In function `srcu_offline_cpu':
	tree.c:(.text+0x4250): multiple definition of `srcu_offline_cpu'
	kernel/rcu/srcutree.o:srcutree.c:(.text+0x2160): first defined here

The corresponding .config file shows CONFIG_TREE_SRCU=y, but no sign
of CONFIG_SRCU, which fatally confuses SRCU's #ifdefs, resulting in
the above errors.  The reason this occurs is the folowing line in
init/Kconfig's definition for TREE_SRCU:

	default y if !TINY_RCU && !CLASSIC_SRCU

If CONFIG_CLASSIC_SRCU=n, as it will be in for allnoconfig, and if
CONFIG_SMP=y, then we will get CONFIG_TREE_SRCU=y but no CONFIG_SRCU,
as seen in the .config file, and which will result in the above errors.
This error did not show up during rcutorture testing because rcutorture
forces CONFIG_SRCU=y, as it must to prevent build errors in rcutorture.c.

This commit therefore conditions TREE_SRCU (and TINY_SRCU, while it is
at it) with SRCU, like this:

	default y if SRCU && !TINY_RCU && !CLASSIC_SRCU

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20170423162205.GP3956@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-24 08:14:48 +02:00
Ingo Molnar 58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Paul E. McKenney f2094107ac Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' into HEAD
doc.2017.04.12a: Documentation updates
fixes.2017.04.19a: Miscellaneous fixes
srcu.2017.04.21a: Parallelize SRCU callback handling
2017-04-21 06:00:13 -07:00
Paul E. McKenney 0248288009 rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick
If you set RCU_FANOUT_LEAF too high, you can get lock contention
on the leaf rcu_node, and you should boot with the skew_tick kernel
parameter set in order to avoid this lock contention.  This commit
therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
this.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:17 -07:00
Paul E. McKenney dad81a2026 srcu: Introduce CLASSIC_SRCU Kconfig option
The TREE_SRCU rewrite is large and a bit on the non-simple side, so
this commit helps reduce risk by allowing the old v4.11 SRCU algorithm
to be selected using a new CLASSIC_SRCU Kconfig option that depends
on RCU_EXPERT.  The default is to use the new TREE_SRCU and TINY_SRCU
algorithms, in order to help get these the testing that they need.
However, if your users do not require the update-side scalability that
is to be provided by TREE_SRCU, select RCU_EXPERT and then CLASSIC_SRCU
to revert back to the old classic SRCU algorithm.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:23 -07:00
Paul E. McKenney d8be81735a srcu: Create a tiny SRCU
In response to automated complaints about modifications to SRCU
increasing its size, this commit creates a tiny SRCU that is
used in SMP=n && PREEMPT=n builds.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:22 -07:00
Nicolas Pitre 0c688614dc console: move console_init() out of tty_io.c
All the console driver handling code lives in printk.c.
Move console_init() there as well so console support can still be used
when the TTY code is configured out. No logical changes from this patch.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-18 18:01:52 +02:00
Steven Rostedt (VMware) b80f0f6c9e ftrace: Have init/main.c call ftrace directly to free init memory
Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.

Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-03 14:04:00 -04:00
Michal Hocko 597b7305dd mm: move mm_percpu_wq initialization earlier
Yang Li has reported that drain_all_pages triggers a WARN_ON which means
that this function is called earlier than the mm_percpu_wq is
initialized on arm64 with CMA configured:

  WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
  Modules linked in:
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
  Hardware name: Freescale Layerscape 2088A RDB Board (DT)
  task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
  PC is at drain_all_pages+0x244/0x25c
  LR is at start_isolate_page_range+0x14c/0x1f0
  [...]
   drain_all_pages+0x244/0x25c
   start_isolate_page_range+0x14c/0x1f0
   alloc_contig_range+0xec/0x354
   cma_alloc+0x100/0x1fc
   dma_alloc_from_contiguous+0x3c/0x44
   atomic_pool_init+0x7c/0x208
   arm64_dma_init+0x44/0x4c
   do_one_initcall+0x38/0x128
   kernel_init_freeable+0x1a0/0x240
   kernel_init+0x10/0xfc
   ret_from_fork+0x10/0x20

Fix this by moving the whole setup_vmstat which is an initcall right now
to init_mm_internals which will be called right after the WQ subsystem
is initialized.

Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Yang Li <pku.leo@gmail.com>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-31 17:13:30 -07:00
Steven Rostedt (VMware) f631718de3 ftrace: Move ftrace_init() to right after memory initialization
Initialize the ftrace records immediately after memory initialization, as
that is all that is required for the records to be created. This will allow
for future work to get function tracing started earlier in the boot process.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 13:08:44 -04:00
Steven Rostedt (VMware) e725c731e3 tracing: Split tracing initialization into two for early initialization
Create an early_trace_init() function that will initialize the buffers and
allow for ealier use of trace_printk(). This will also allow for future work
to have function tracing start earlier at boot up.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 13:08:43 -04:00
Linus Torvalds 84c37c168c Change get_random_{int,log} to use the CRNG used by /dev/urandom and
getrandom(2).  It's faster and arguably more secure than cut-down MD5
 that we had been using.
 
 Also do some code cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAljCENEACgkQ8vlZVpUN
 gaP8lwf7BFtF52mKQcsVYxxZtRPH5dQwJCh3rohQ0WEJi5hHyZPZNz24dPHgc8Xl
 GDq7v7o10dL3aeK6P51lYNcDb9xwYakCXm5sw46c5juca/VAVaxHb/kSDPSPUCNj
 7n7mNSM61UhYAN10AXi9FGJo/Rdr0U5F1VfoWVYqaHYsItYLCjlSk6ob7vKxCPUd
 458qaGBvK8luwQgFPQftJ20j81zXNuRe5JHjCQ2LtaRWM8kNI/wmyNSokD73BkZl
 k8B7VqG4YpKp+4xgThp12GpXHrKB9kzQfmM4dZQQiGai9Ni59+iNqEcumv0Jb5MG
 gY/m5Wc1Q45/5FosPXQYHzMPHrSJ3A==
 =g1OD
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random updates from Ted Ts'o:
 "Change get_random_{int,log} to use the CRNG used by /dev/urandom and
  getrandom(2). It's faster and arguably more secure than cut-down MD5
  that we had been using.

  Also do some code cleanup"

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: move random_min_urandom_seed into CONFIG_SYSCTL ifdef block
  random: convert get_random_int/long into get_random_u32/u64
  random: use chacha20 for get_random_int/long
  random: fix comment for unused random_min_urandom_seed
  random: remove variable limit
  random: remove stale urandom_init_wait
  random: remove stale maybe_reseed_primary_crng
2017-03-11 09:08:47 -08:00
Ingo Molnar 1777e46355 sched/headers: Prepare to move _init() prototypes from <linux/sched.h> to <linux/sched/init.h>
But first introduce a trivial header and update usage sites.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Ingo Molnar 5c2c5c5514 sched/headers, vfs/execve: Prepare to move the do_execve*() prototypes from <linux/sched.h> to <linux/binfmts.h>
But first update the usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar 9164bb4a18 sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h>
Update all usage sites first.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar 68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar 299300258d sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar 38b8d208a4 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nmi.h>
We are going to move softlockup APIs out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

<linux/nmi.h> already includes <linux/sched.h>.

Include the <linux/nmi.h> header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Linus Torvalds cf393195c3 Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax
Pull IDR rewrite from Matthew Wilcox:
 "The most significant part of the following is the patch to rewrite the
  IDR & IDA to be clients of the radix tree. But there's much more,
  including an enhancement of the IDA to be significantly more space
  efficient, an IDR & IDA test suite, some improvements to the IDR API
  (and driver changes to take advantage of those improvements), several
  improvements to the radix tree test suite and RCU annotations.

  The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
  for most of the last cycle. Coupled with the IDR test suite, I feel
  pretty confident that any remaining bugs are quite hard to hit. 0-day
  did a great job of watching my git tree and pointing out problems; as
  it hit them, I added new test-cases to be sure not to be caught the
  same way twice"

Willy goes on to expand a bit on the IDR rewrite rationale:
 "The radix tree and the IDR use very similar data structures.

  Merging the two codebases lets us share the memory allocation pools,
  and results in a net deletion of 500 lines of code. It also opens up
  the possibility of exposing more of the features of the radix tree to
  users of the IDR (and I have some interesting patches along those
  lines waiting for 4.12)

  It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
  will shrink a fair few data structures that embed an IDR"

* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
  radix tree test suite: Add config option for map shift
  idr: Add missing __rcu annotations
  radix-tree: Fix __rcu annotations
  radix-tree: Add rcu_dereference and rcu_assign_pointer calls
  radix tree test suite: Run iteration tests for longer
  radix tree test suite: Fix split/join memory leaks
  radix tree test suite: Fix leaks in regression2.c
  radix tree test suite: Fix leaky tests
  radix tree test suite: Enable address sanitizer
  radix_tree_iter_resume: Fix out of bounds error
  radix-tree: Store a pointer to the root in each node
  radix-tree: Chain preallocated nodes through ->parent
  radix tree test suite: Dial down verbosity with -v
  radix tree test suite: Introduce kmalloc_verbose
  idr: Return the deleted entry from idr_remove
  radix tree test suite: Build separate binaries for some tests
  ida: Use exceptional entries for small IDAs
  ida: Move ida_bitmap to a percpu variable
  Reimplement IDR and IDA using the radix tree
  radix-tree: Add radix_tree_iter_delete
  ...
2017-02-28 20:29:41 -08:00
Linus Torvalds 86292b33d4 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - a few MM remainders

 - misc things

 - autofs updates

 - signals

 - affs updates

 - ipc

 - nilfs2

 - spelling.txt updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits)
  mm, x86: fix HIGHMEM64 && PARAVIRT build config for native_pud_clear()
  mm: add arch-independent testcases for RODATA
  hfs: atomically read inode size
  mm: clarify mm_struct.mm_{users,count} documentation
  mm: use mmget_not_zero() helper
  mm: add new mmget() helper
  mm: add new mmgrab() helper
  checkpatch: warn when formats use %Z and suggest %z
  lib/vsprintf.c: remove %Z support
  scripts/spelling.txt: add some typo-words
  scripts/spelling.txt: add "followings" pattern and fix typo instances
  scripts/spelling.txt: add "therfore" pattern and fix typo instances
  scripts/spelling.txt: add "overwriten" pattern and fix typo instances
  scripts/spelling.txt: add "overwritting" pattern and fix typo instances
  scripts/spelling.txt: add "deintialize(d)" pattern and fix typo instances
  scripts/spelling.txt: add "disassocation" pattern and fix typo instances
  scripts/spelling.txt: add "omited" pattern and fix typo instances
  scripts/spelling.txt: add "explictely" pattern and fix typo instances
  scripts/spelling.txt: add "applys" pattern and fix typo instances
  scripts/spelling.txt: add "configuartion" pattern and fix typo instances
  ...
2017-02-27 23:09:29 -08:00
Linus Torvalds f7878dc3a9 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Several noteworthy changes.

   - Parav's rdma controller is finally merged. It is very straight
     forward and can limit the abosolute numbers of common rdma
     constructs used by different cgroups.

   - kernel/cgroup.c got too chubby and disorganized. Created
     kernel/cgroup/ subdirectory and moved all cgroup related files
     under kernel/ there and reorganized the core code. This hurts for
     backporting patches but was long overdue.

   - cgroup v2 process listing reimplemented so that it no longer
     depends on allocating a buffer large enough to cache the entire
     result to sort and uniq the output. v2 has always mangled the sort
     order to ensure that users don't depend on the sorted output, so
     this shouldn't surprise anybody. This makes the pid listing
     functions use the same iterators that are used internally, which
     have to have the same iterating capabilities anyway.

   - perf cgroup filtering now works automatically on cgroup v2. This
     patch was posted a long time ago but somehow fell through the
     cracks.

   - misc fixes asnd documentation updates"

* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
  kernfs: fix locking around kernfs_ops->release() callback
  cgroup: drop the matching uid requirement on migration for cgroup v2
  cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
  cgroup: misc cleanups
  cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
  cgroup: track migration context in cgroup_mgctx
  cgroup: cosmetic update to cgroup_taskset_add()
  rdmacg: Fixed uninitialized current resource usage
  cgroup: Add missing cgroup-v2 PID controller documentation.
  rdmacg: Added documentation for rdmacg
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added rdma cgroup controller
  cgroup: fix a comment typo
  cgroup: fix RCU related sparse warnings
  cgroup: move namespace code to kernel/cgroup/namespace.c
  cgroup: rename functions for consistency
  cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
  cgroup: separate out cgroup1_kf_syscall_ops
  cgroup: refactor mount path and clearly distinguish v1 and v2 paths
  cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
  ...
2017-02-27 21:41:08 -08:00
Jinbum Park 2959a5f726 mm: add arch-independent testcases for RODATA
This patch makes arch-independent testcases for RODATA.  Both x86 and
x86_64 already have testcases for RODATA, But they are arch-specific
because using inline assembly directly.

And cacheflush.h is not a suitable location for rodata-test related
things.  Since they were in cacheflush.h, If someone change the state of
CONFIG_DEBUG_RODATA_TEST, It cause overhead of kernel build.

To solve the above issues, write arch-independent testcases and move it
to shared location.

[jinb.park7@gmail.com: fix config dependency]
  Link: http://lkml.kernel.org/r/20170209131625.GA16954@pjb1027-Latitude-E5410
Link: http://lkml.kernel.org/r/20170129105436.GA9303@pjb1027-Latitude-E5410
Signed-off-by: Jinbum Park <jinb.park7@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Lokesh Vutla 0886551480 initramfs: finish fput() before accessing any binary from initramfs
Commit 4a9d4b024a ("switch fput to task_work_add") implements a
schedule_work() for completing fput(), but did not guarantee calling
__fput() after unpacking initramfs.  Because of this, there is a
possibility that during boot a driver can see ETXTBSY when it tries to
load a binary from initramfs as fput() is still pending on that binary.

This patch makes sure that fput() is completed after unpacking initramfs
and removes the call to flush_delayed_fput() in kernel_init() which
happens very late after unpacking initramfs.

Link: http://lkml.kernel.org/r/20170201140540.22051-1-lokeshvutla@ti.com
Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com>
Reported-by: Murali Karicheri <m-karicheri2@ti.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tero Kristo <t-kristo@ti.com>
Cc: Sekhar Nori <nsekhar@ti.com>
Cc: Nishanth Menon <nm@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Linus Torvalds bc49a7831b Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "142 patches:

   - DAX updates

   - various misc bits

   - OCFS2 updates

   - most of MM"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits)
  mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
  mm: fix <linux/pagemap.h> stray kernel-doc notation
  zram: remove obsolete sysfs attrs
  mm/memblock.c: remove unnecessary log and clean up
  oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
  mm: drop unused argument of zap_page_range()
  mm: drop zap_details::check_swap_entries
  mm: drop zap_details::ignore_dirty
  mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled
  mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
  mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically
  mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  lib/show_mem.c: teach show_mem to work with the given nodemask
  arch, mm: remove arch specific show_mem
  mm, page_alloc: warn_alloc print nodemask
  mm, page_alloc: do not report all nodes in show_mem
  Revert "mm: bail out in shrink_inactive_list()"
  mm, vmscan: consider eligible zones in get_scan_count
  mm, vmscan: cleanup lru size claculations
  mm, vmscan: do not count freed pages as PGDEACTIVATE
  ...
2017-02-22 19:29:24 -08:00
Linus Torvalds 7d91de7443 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:

 - Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven
   Rostedt as the printk reviewer. This idea came up after the
   discussion about printk issues at Kernel Summit. It was formulated
   and discussed at lkml[1].

 - Extend a lock-less NMI per-cpu buffers idea to handle recursive
   printk() calls by Sergey Senozhatsky[2]. It is the first step in
   sanitizing printk as discussed at Kernel Summit.

   The change allows to see messages that would normally get ignored or
   would cause a deadlock.

   Also it allows to enable lockdep in printk(). This already paid off.
   The testing in linux-next helped to discover two old problems that
   were hidden before[3][4].

 - Remove unused parameter by Sergey Senozhatsky. Clean up after a past
   change.

[1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com
[2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com
[3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
[4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: drop call_console_drivers() unused param
  printk: convert the rest to printk-safe
  printk: remove zap_locks() function
  printk: use printk_safe buffers in printk
  printk: report lost messages in printk safe/nmi contexts
  printk: always use deferred printk when flush printk_safe lines
  printk: introduce per-cpu safe_print seq buffer
  printk: rename nmi.c and exported api
  printk: use vprintk_func in vprintk()
  MAINTAINERS: Add printk maintainers
2017-02-22 17:33:34 -08:00
Tejun Heo 1663f26df3 slub: make sysfs directories for memcg sub-caches optional
SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
bunch of debug files.  Usually, there aren't that many caches on a
system and this doesn't really matter; however, if memcg is in use, each
cache can have per-cgroup sub-caches.  SLUB creates the same directories
for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.

Unfortunately, because there can be a lot of cgroups, active or
draining, the product of the numbers of caches, cgroups and files in
each directory can reach a very high number - hundreds of thousands is
commonplace.  Millions and beyond aren't difficult to reach either.

What's under /sys/kernel/slab is primarily for debugging and the
information and control on the a root cache already cover its
sub-caches.  While having a separate directory for each sub-cache can be
helpful for development, it doesn't make much sense to pay this amount
of overhead by default.

This patch introduces a boot parameter slub_memcg_sysfs which determines
whether to create sysfs directories for per-memcg sub-caches.  It also
adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
default value and defaults to 0.

[akpm@linux-foundation.org: kset_unregister(NULL) is legal]
Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Linus Torvalds e30aee9e10 char/misc driver patches for 4.11-rc1
Here is the big char/misc driver patchset for 4.11-rc1.
 
 Lots of different driver subsystems updated here.  Rework for the hyperv
 subsystem to handle new platforms better, mei and w1 and extcon driver
 updates, as well as a number of other "minor" driver updates.  Full
 details are in the shortlog below.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2iRQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynhFACguVE+/ixj5u5bT5DXQaZNai/6zIAAmgMWwd/t
 YTD2cwsJsGbTT1fY3SUe
 =CiSI
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver patchset for 4.11-rc1.

  Lots of different driver subsystems updated here: rework for the
  hyperv subsystem to handle new platforms better, mei and w1 and extcon
  driver updates, as well as a number of other "minor" driver updates.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (169 commits)
  goldfish: Sanitize the broken interrupt handler
  x86/platform/goldfish: Prevent unconditional loading
  vmbus: replace modulus operation with subtraction
  vmbus: constify parameters where possible
  vmbus: expose hv_begin/end_read
  vmbus: remove conditional locking of vmbus_write
  vmbus: add direct isr callback mode
  vmbus: change to per channel tasklet
  vmbus: put related per-cpu variable together
  vmbus: callback is in softirq not workqueue
  binder: Add support for file-descriptor arrays
  binder: Add support for scatter-gather
  binder: Add extra size to allocator
  binder: Refactor binder_transact()
  binder: Support multiple /dev instances
  binder: Deal with contexts in debugfs
  binder: Support multiple context managers
  binder: Split flat_binder_object
  auxdisplay: ht16k33: remove private workqueue
  auxdisplay: ht16k33: rework input device initialization
  ...
2017-02-22 11:38:22 -08:00
Linus Torvalds 7bb033829e This renames the (now inaccurate) CONFIG_DEBUG_RODATA and related config
CONFIG_SET_MODULE_RONX to the more sensible CONFIG_STRICT_KERNEL_RWX and
 CONFIG_STRICT_MODULE_RWX.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJYrJ2ZAAoJEIly9N/cbcAmb4UQAIDnJYF4xecUfxofypQwt7ey
 DcR8SH+g/Rkm3v2bUOrVdlP333ePRUEs6C47PgYSLlKsZiQA3H6bsTILHJZGHZ3j
 laNH4sjQ0j+Sr2rHXk8fLz3YpHHwIy49bfu2ERXFH92BMnTMCv1h9IWFgOMH+4y5
 09n16TPHMUj1k0DGjHO/n03qLIKOo3Xy/Va5dhQ/6dGU4zR4KhOBnhLlG3IU7Atd
 KTR+ba/qym7bDQbTezMuaajTiZctr6a45yBKDWu4Knu+ot2a7K7fYvfRT3YVb5SU
 aTSYps7NKQbewcQSqNdek1zytoy2Ck7CH511e+3ypwNmao5KQwRgH7OX1pDEXyZv
 rGDaVzKMTSddH23jLEKUbpR847Lza9+V3h5YtbMG8GgiCKs91Ec666iEE3NVZBO8
 1hiiYhE2iDxi10B/EZZcn2gOt2JaB2m2GxWIrJOz4txtDAWbUYlhUpWEUynBTPQ0
 cYBZVnge81awipZJTWUv57LyufnTnMSK3i8Q8t0woj4C7pFbPYfjnKCrgwTQyAvr
 mD4uFBrgFb1lftbc3kfTdeoZmXerzvubsstWdr3rU3nsiJFzY1SwJZe8n0THyL4g
 DzURFrj/8UXb32Kavysz6FTxFO9u87mJm6yqHn/Y3bEK7Y7cch/NYjRC9Q6dpH+4
 ld9apHF6iRrqgf+x6oOh
 =7KhR
 -----END PGP SIGNATURE-----

Merge tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull rodata updates from Kees Cook:
 "This renames the (now inaccurate) DEBUG_RODATA and related
  SET_MODULE_RONX configs to the more sensible STRICT_KERNEL_RWX and
  STRICT_MODULE_RWX"

* tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
  arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
2017-02-21 17:56:45 -08:00
Linus Torvalds 6d1c42d9b9 Final extable.h related changes.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYrFn+AAoJEOvOhAQsB9HWbDAP/i1bMxYJSnwD2D/PBvzpY2AW
 uinEigaRr6kpRCOfa9FrfgxokKfosZOx5h7Se3f6O3mPwgpsU+dqbaE18Z5XSgxh
 +a9+HvAv3/XNZg7SvBtBaoYDblHWJ6AJ9rN9fuKg3e8btE3rSFG147vj1atlVz1+
 iRsXcCPb1p5db2+wZdsYJPI5Zwt4N0nR6cxPX4RQ6jseiVqPpt/FDtB60RYCjbID
 J0cOk1VV1Jn2H1Rfl+hjNQjIPMNx3zftOLQ2usr/kwuEqeuTKZR06yLXFOT6bdXU
 6JBdfL+e2kHKbaLyJGr6MCjTokaMgN3SGZJWJqHgk5Nggq5BD+2c4AOs8t6URnE0
 KThGiyY+YI5C/W6kMlEozLARiMKe4IIQpx1uj2Hv+YkndntvqjCqvfdQQJKnzm0G
 YWfPnsG2dysiovwEOBoBwyFVFLFzzJ1o3uyRGkCzVGaLQVzD5ktAJM6ynMOxwcIn
 zSN+agzdTAD7QJIDaa1p2r5fAqy7i4xIn2+ts1s9c410fdUTB4A2QJzTywjPAdCp
 IRxcLLpYDeBZ5cbhqjR677WgPtteYFTljoX+/8BOFO2PI+HjKHrxfW02WSiCS0iu
 CUndrlpmuyKlIrpw7mYpDTbORcSQSiUkB1pRGT7poh2p0KKAGSo9ZrRbx+qRvPdH
 AxO+ZR6Jjj5LAMk5MkRz
 =RK2h
 -----END PGP SIGNATURE-----

Merge tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux

Pull exception table module split from Paul Gortmaker:
 "Final extable.h related changes.

  This completes the separation of exception table content from the
  module.h header file. This is achieved with the final commit that
  removes the one line back compatible change that sourced extable.h
  into the module.h file.

  The commits are unchanged since January, with the exception of a
  couple Acks that came in for the last two commits a bit later. The
  changes have been in linux-next for quite some time[1] and have got
  widespread arch coverage via toolchains I have and also from
  additional ones the kbuild bot has.

  Maintaners of the various arch were Cc'd during the postings to
  lkml[2] and informed that the intention was to take the remaining arch
  specific changes and lump them together with the final two non-arch
  specific changes and submit for this merge window.

  The ia64 diffstat stands out and probably warrants a mention. In an
  earlier review, Al Viro made a valid comment that the original header
  separation of content left something to be desired, and that it get
  fixed as a part of this change, hence the larger diffstat"

* tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (21 commits)
  module.h: remove extable.h include now users have migrated
  core: migrate exception table users off module.h and onto extable.h
  cris: migrate exception table users off module.h and onto extable.h
  hexagon: migrate exception table users off module.h and onto extable.h
  microblaze: migrate exception table users off module.h and onto extable.h
  unicore32: migrate exception table users off module.h and onto extable.h
  score: migrate exception table users off module.h and onto extable.h
  metag: migrate exception table users off module.h and onto extable.h
  arc: migrate exception table users off module.h and onto extable.h
  nios2: migrate exception table users off module.h and onto extable.h
  sparc: migrate exception table users onto extable.h
  openrisc: migrate exception table users off module.h and onto extable.h
  frv: migrate exception table users off module.h and onto extable.h
  sh: migrate exception table users off module.h and onto extable.h
  xtensa: migrate exception table users off module.h and onto extable.h
  mn10300: migrate exception table users off module.h and onto extable.h
  alpha: migrate exception table users off module.h and onto extable.h
  arm: migrate exception table users off module.h and onto extable.h
  m32r: migrate exception table users off module.h and onto extable.h
  ia64: ensure exception table search users include extable.h
  ...
2017-02-21 14:28:55 -08:00
Linus Torvalds 42e1b14b6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement wraparound-safe refcount_t and kref_t types based on
     generic atomic primitives (Peter Zijlstra)

   - Improve and fix the ww_mutex code (Nicolai Hähnle)

   - Add self-tests to the ww_mutex code (Chris Wilson)

   - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
     Bueso)

   - Micro-optimize the current-task logic all around the core kernel
     (Davidlohr Bueso)

   - Tidy up after recent optimizations: remove stale code and APIs,
     clean up the code (Waiman Long)

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  fork: Fix task_struct alignment
  locking/spinlock/debug: Remove spinlock lockup detection code
  lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
  lkdtm: Convert to refcount_t testing
  kref: Implement 'struct kref' using refcount_t
  refcount_t: Introduce a special purpose refcount type
  sched/wake_q: Clarify queue reinit comment
  sched/wait, rcuwait: Fix typo in comment
  locking/mutex: Fix lockdep_assert_held() fail
  locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
  locking/rwsem: Reinit wake_q after use
  locking/rwsem: Remove unnecessary atomic_long_t casts
  jump_labels: Move header guard #endif down where it belongs
  locking/atomic, kref: Implement kref_put_lock()
  locking/ww_mutex: Turn off __must_check for now
  locking/atomic, kref: Avoid more abuse
  locking/atomic, kref: Use kref_get_unless_zero() more
  locking/atomic, kref: Kill kref_sub()
  locking/atomic, kref: Add kref_read()
  locking/atomic, kref: Add KREF_INIT()
  ...
2017-02-20 13:23:30 -08:00
Linus Torvalds 828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Linus Torvalds 32e2d7c8af Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Changes to the EFI init code to establish whether secure boot
     authentication was performed at boot time. (Josh Boyer, David
     Howells)

   - Wire up the UEFI memory attributes table for x86. This eliminates
     any runtime memory regions that are both writable and executable,
     on recent firmware versions. (Sai Praneeth)

   - Move the BGRT init code to an earlier stage so that we can still
     use efi_mem_reserve(). (Dave Young)

   - Preserve debug symbols in the ARM/arm64 UEFI stub (Ard Biesheuvel)

   - Code deduplication work and various other cleanups (Lukas Wunner)

   - ... plus various other fixes and cleanups"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/libstub: Make file I/O chunking x86-specific
  efi: Print the secure boot status in x86 setup_arch()
  efi: Disable secure boot if shim is in insecure mode
  efi: Get and store the secure boot status
  efi: Add SHIM and image security database GUID definitions
  arm/efi: Allow invocation of arbitrary runtime services
  x86/efi: Allow invocation of arbitrary runtime services
  efi/libstub: Preserve .debug sections after absolute relocation check
  efi/x86: Add debug code to print cooked memmap
  efi/x86: Move the EFI BGRT init code to early init code
  efi: Use typed function pointers for the runtime services table
  efi/esrt: Fix typo in pr_err() message
  x86/efi: Add support for EFI_MEMORY_ATTRIBUTES_TABLE
  efi: Introduce the EFI_MEM_ATTR bit and set it from the memory attributes table
  efi: Make EFI_MEMORY_ATTRIBUTES_TABLE initialization common across all architectures
  x86/efi: Deduplicate efi_char16_printk()
  efi: Deduplicate efi_file_size() / _read() / _close()
2017-02-20 11:47:11 -08:00
Linus Torvalds f7458a5d63 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The RCU changes in this cycle are:

   - Dynticks updates, consolidating open-coded counter accesses into a
     well-defined API

   - SRCU updates: Simplify algorithm, add formal verification

   - Documentation updates

   - Miscellaneous fixes

   - Torture-test updates

  Most of the diffstat comes from the relatively large documentation
  update"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  srcu: Reduce probability of SRCU ->unlock_count[] counter overflow
  rcutorture: Add CBMC-based formal verification for SRCU
  srcu: Force full grace-period ordering
  srcu: Implement more-efficient reader counts
  rcu: Adjust FQS offline checks for exact online-CPU detection
  rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
  rcu: Abstract extended quiescent state determination
  rcu: Abstract dynticks extended quiescent state enter/exit operations
  rcu: Add lockdep checks to synchronous expedited primitives
  rcu: Eliminate unused expedited_normal counter
  llist: Clarify comments about when locking is needed
  rcu: Fix comment in rcu_organize_nocb_kthreads()
  rcu: Enable RCU tracepoints by default to aid in debugging
  rcu: Make rcu_cpu_starting() use its "cpu" argument
  rcu: Add comment headers to expedited-grace-period counter functions
  rcu: Don't wake rcuc/X kthreads on NOCB CPUs
  rcu: Re-enable TASKS_RCU for User Mode Linux
  rcu: Once again use NMI-based stack traces in stall warnings
  rcu: Remove short-term CPU kicking
  rcu: Add long-term CPU kicking
  ...
2017-02-20 11:21:17 -08:00
Matthew Wilcox 0a835c4f09 Reimplement IDR and IDA using the radix tree
The IDR is very similar to the radix tree.  It has some functionality that
the radix tree did not have (alloc next free, cyclic allocation, a
callback-based for_each, destroy tree), which is readily implementable on
top of the radix tree.  A few small changes were needed in order to use a
tag to represent nodes with free space below them.  More extensive
changes were needed to support storing NULL as a valid entry in an IDR.
Plain radix trees still interpret NULL as a not-present entry.

The IDA is reimplemented as a client of the newly enhanced radix tree.  As
in the current implementation, it uses a bitmap at the last level of the
tree.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-02-13 21:44:01 -05:00
Paul Gortmaker 8a293be0d6 core: migrate exception table users off module.h and onto extable.h
These files were including module.h for exception table related
functions.  We've now separated that content out into its own file
"extable.h" so now move over to that and where possible, avoid all
the extra header content in module.h that we don't really need to
compile these non-modular files.

Note:
   init/main.c still needs module.h for __init_or_module
   kernel/extable.c still needs module.h for is_module_text_address

...and so we don't get the benefit of removing module.h from the cpp
feed for these two files, unlike the almost universal 1:1 exchange
of module.h for extable.h we were able to do in the arch dirs.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-02-09 16:38:53 -05:00
Sergey Senozhatsky f92bac3b14 printk: rename nmi.c and exported api
A preparation patch for printk_safe work. No functional change.
- rename nmi.c to print_safe.c
- add `printk_safe' prefix to some (which used both by printk-safe
  and printk-nmi) of the exported functions.

Link: http://lkml.kernel.org/r/20161227141611.940-3-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 11:02:33 +01:00
Laura Abbott 0f5bf6d0af arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
Both of these options are poorly named. The features they provide are
necessary for system security and should not be considered debug only.
Change the names to CONFIG_STRICT_KERNEL_RWX and
CONFIG_STRICT_MODULE_RWX to better describe what these options do.

Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-02-07 12:32:52 -08:00
Ingo Molnar 87a8d03266 Linux 4.10-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYl7EJAAoJEHm+PkMAQRiGZ7kIAIQJNCOInU/14K5tjEz9jrKM
 VPWebPm8a1E36C/Nk7mXOW1onOLSHJa7YacmmazbncmtfOANyaUqKVME88+lTWT1
 Pj2ZJyeQE7XpxY0N1eXy0PWZPgPI4ENcYXXERueHTFClXdlK6550obyenw/Tqxhv
 na5Yw66GSXMdNy/kTsK1pp3aJaENYWb2ueYiwr4YNQPUjhs9Y2zKAlMBwHOjuzol
 aGz0482M4cKY6UmMmi8DVVEET4Sp6cQ9YCjtOP+NUOyEjAJ6XG16SejYTSQyjdK7
 w0AAc9F2uCjlNPNy6QvJRO0FFiNhSdaYspPt0IgWa6bpY4m26n1DEBdqKT1uzO4=
 =e20h
 -----END PGP SIGNATURE-----

Merge tag 'v4.10-rc7' into efi/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 08:49:17 +01:00
Greg Kroah-Hartman 17fa87fe5a Merge 4.10-rc7 into char-misc-next
We want the hv and other fixes in here as well to handle merge and
testing issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-06 09:39:13 +01:00
Ard Biesheuvel 56067812d5 kbuild: modversions: add infrastructure for emitting relative CRCs
This add the kbuild infrastructure that will allow architectures to emit
vmlinux symbol CRCs as 32-bit offsets to another location in the kernel
where the actual value is stored. This works around problems with CRCs
being mistaken for relocatable symbols on kernels that self relocate at
runtime (i.e., powerpc with CONFIG_RELOCATABLE=y)

For the kbuild side of things, this comes down to the following:

 - introducing a Kconfig symbol MODULE_REL_CRCS

 - adding a -R switch to genksyms to instruct it to emit the CRC symbols
   as references into the .rodata section

 - making modpost distinguish such references from absolute CRC symbols
   by the section index (SHN_ABS)

 - making kallsyms disregard non-absolute symbols with a __crc_ prefix

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 08:28:25 -08:00
Dave Young 7b0a911478 efi/x86: Move the EFI BGRT init code to early init code
Before invoking the arch specific handler, efi_mem_reserve() reserves
the given memory region through memblock.

efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
time memblock is dead and should not be used anymore.

The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
table, so move parsing of the BGRT table to ACPI early boot code to
ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.

Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-acpi@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1485868902-20401-9-git-send-email-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 08:45:46 +01:00
Ingo Molnar a8709fa4a0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Dynticks updates, consolidating open-coded counter accesses into a well-defined API

 - SRCU updates: Simplify algorithm, add formal verification

 - Documentation updates

 - Miscellaneous fixes

 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-31 07:45:42 +01:00
Jason A. Donenfeld f5b98461cb random: use chacha20 for get_random_int/long
Now that our crng uses chacha20, we can rely on its speedy
characteristics for replacing MD5, while simultaneously achieving a
higher security guarantee. Before the idea was to use these functions if
you wanted random integers that aren't stupidly insecure but aren't
necessarily secure either, a vague gray zone, that hopefully was "good
enough" for its users. With chacha20, we can strengthen this claim,
since either we're using an rdrand-like instruction, or we're using the
same crng as /dev/urandom. And it's faster than what was before.

We could have chosen to replace this with a SipHash-derived function,
which might be slightly faster, but at the cost of having yet another
RNG construction in the kernel. By moving to chacha20, we have a single
RNG to analyze and verify, and we also already get good performance
improvements on all platforms.

Implementation-wise, rather than use a generic buffer for both
get_random_int/long and memcpy based on the size needs, we use a
specific buffer for 32-bit reads and for 64-bit reads. This way, we're
guaranteed to always have aligned accesses on all platforms. While
slightly more verbose in C, the assembly this generates is a lot
simpler than otherwise.

Finally, on 32-bit platforms where longs and ints are the same size,
we simply alias get_random_int to get_random_long.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-27 14:25:06 -05:00
Paul E. McKenney 1626c365f8 rcu: Re-enable TASKS_RCU for User Mode Linux
Now that User Mode Linux supports arch_irqs_disabled_flags(), this
commit re-enables TASKS_RCU for User Mode Linux.

Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:12 -08:00
William Breathitt Gray ad90a3de9d pc104: Introduce the PC104 Kconfig option
PC/104 form factor devices serve a specific niche of embedded system
users; most Linux users will not have PC/104 form factor devices. This
patch introduces the PC104 Kconfig option, which should be used to
filter PC/104 specific device drivers and options, so that only those
users interested in PC/104 related options are exposed to them.

Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 12:42:25 +01:00
Sebastian Andrzej Siewior 7c6094db59 rcu: update: Make RCU_EXPEDITE_BOOT be the default
RCU_EXPEDITE_BOOT should speed up the boot process by enforcing
synchronize_rcu_expedited() instead of synchronize_rcu() during the boot
process. There should be no reason why one does not want this and there
is no need worry about real time latency at this point.
Therefore make it default.

Note that users wishing to avoid expediting entirely, for example when
bringing up new hardware possibly having flaky IPIs, can use the
rcu_normal boot parameter to override boot-time expediting.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Reworded commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 16:56:39 -08:00
Peter Zijlstra 1e24edca05 locking/atomic, kref: Add KREF_INIT()
Since we need to change the implementation, stop exposing internals.

Provide KREF_INIT() to allow static initialization of struct kref.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Peter Zijlstra 9881b024b7 sched/clock: Delay switching sched_clock to stable
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.

Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:59 +01:00
Arnd Bergmann 73b3514735 cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
not right when CONFIG_NET is disabled and there is no socket interface:

warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

I don't know what the correct solution for this is, but simply removing
the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
'if NET' section avoids the warning and does not produce other build
errors.

Fixes: 483c4933ea ("cgroup: Fix CGROUP_BPF config")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 09:47:10 -05:00
Parav Pandit 39d3e7584a rdmacg: Added rdma cgroup controller
Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB resources.

Added rdma cgroup header file which defines its APIs to perform
charging/uncharging functionality. It also defined APIs for RDMA/IB
stack for device registration. Devices which are registered will
participate in controller functions of accounting and limit
enforcements. It define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup entity which allows setting up accounting limits
on per device basis.

Currently resources are defined by the RDMA cgroup.

Resource pool is created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user
configuration is done. Its a tradeoff of memory vs little more code
space that creates resource pool object whenever necessary, instead of
creating them during cgroup creation and device registration time.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-10 11:14:27 -05:00
Nicholas Piggin 6290602709 mm: add PageWaiters indicating tasks are waiting for a page bit
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.

This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.

The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).

This improves the performance of page lock intensive microbenchmarks by
2-3%.

Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-25 11:54:48 -08:00
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds 52f40e9d65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes and cleanups from David Miller:

 1) Revert bogus nla_ok() change, from Alexey Dobriyan.

 2) Various bpf validator fixes from Daniel Borkmann.

 3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
    drivers, from Dongpo Li.

 4) Several ethtool ksettings conversions from Philippe Reynes.

 5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.

 6) XDP support for virtio_net, from John Fastabend.

 7) Fix NAT handling within a vrf, from David Ahern.

 8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
  net: mv643xx_eth: fix build failure
  isdn: Constify some function parameters
  mlxsw: spectrum: Mark split ports as such
  cgroup: Fix CGROUP_BPF config
  qed: fix old-style function definition
  net: ipv6: check route protocol when deleting routes
  r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
  irda: w83977af_ir: cleanup an indent issue
  net: sfc: use new api ethtool_{get|set}_link_ksettings
  net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
  net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
  net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
  net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
  bpf: fix mark_reg_unknown_value for spilled regs on map value marking
  bpf: fix overflow in prog accounting
  bpf: dynamically allocate digest scratch buffer
  gtp: Fix initialization of Flags octet in GTPv1 header
  gtp: gtp_check_src_ms_ipv4() always return success
  net/x25: use designated initializers
  isdn: use designated initializers
  ...
2016-12-17 20:17:04 -08:00
Andy Lutomirski 483c4933ea cgroup: Fix CGROUP_BPF config
CGROUP_BPF depended on SOCK_CGROUP_DATA which can't be manually
enabled, making it rather challenging to turn CGROUP_BPF on.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17 21:42:45 -05:00
Linus Torvalds 4d98ead183 Modules updates for v4.10
Summary of modules changes for the 4.10 merge window:
 
 * The rodata= cmdline parameter has been extended to additionally
   apply to module mappings
 
 * Fix a hard to hit race between module loader error/clean up
   handling and ftrace registration
 
 * Some code cleanups, notably panic.c and modules code use a
   unified taint_flags table now. This is much cleaner than
   duplicating the taint flag code in modules.c
 
 Signed-off-by: Jessica Yu <jeyu@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYUf6/AAoJEMBFfjjOO8Fy5NoP+gOIus26yWWGymI495jVnX7n
 wCga5JgwOL0SLBIPmiDVI7K+jz4eoQZb94eJcwkWDuw2/IvOdF1kB8ha1EOBRMSg
 nb9HfIDlWiAPKkyUxe+k6XDb+BMPN3FUSYmBAKD3utsQkD1JWBLY8Id4e234y8Fo
 sb3a6rLJbvIEXANrMeU7zO4/y1bVxQAeQPQbVPwlid5s76RKYH6JdGXoo6FKK0uE
 Z3I8uQjqjmJ5U4vpjjWl0w+Qa7hIm/x05GpirtNxN6ztxjR+98c/4uRIry8oOX+I
 KqRXDOnJ1l/rCwhp+pGLwPfCoDds+V3bknyOwYoxK3hqVVUAd8H0qd1JQ8XClwyJ
 jnE0+EQpTt9brOO1Oq2XC+EDjpiuyYm3u91TFwE2VFmP98daBZsX6qY7bm03/GQq
 ZLRthWPILNX9glGj4nbHQgdAKmRvYDO3SzWjFZNA75Mr2hbRKLJoWNvfgupDgjsF
 giawxV/OcWXvEX92fzkwoUszpfWwoDhGsbimG2SCKYB87vNniG7wrgdjp5aWHhOL
 qCUpUhCvE9/dO7kPRinqk5tnpAUGY2jMZ0QgVbpToF6FiHJJSyDjWHR9n0Bl1QTX
 uAEZB/Hoav9frZ+MQC/1Yzhq5ejDbEm1ByjolJgbjl6YHBlQceL6NQpFmyEkrn7c
 Tx+Q/PvG7/gfxFGMirf1
 =bhCS
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "Summary of modules changes for the 4.10 merge window:

   - The rodata= cmdline parameter has been extended to additionally
     apply to module mappings

   - Fix a hard to hit race between module loader error/clean up
     handling and ftrace registration

   - Some code cleanups, notably panic.c and modules code use a unified
     taint_flags table now. This is much cleaner than duplicating the
     taint flag code in modules.c"

* tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: fix DEBUG_SET_MODULE_RONX typo
  module: extend 'rodata=off' boot cmdline parameter to module mappings
  module: Fix a comment above strong_try_module_get()
  module: When modifying a module's text ignore modules which are going away too
  module: Ensure a module's state is set accordingly during module coming cleanup code
  module: remove trailing whitespace
  taint/module: Clean up global and module taint flags handling
  modpost: free allocated memory
2016-12-14 20:12:43 -08:00
Linus Torvalds c11a6cfb01 Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Mostly patches to initialize workqueue subsystem earlier and get rid
  of keventd_up().

  The patches were headed for the last merge cycle but got delayed due
  to a bug found late minute, which is fixed now.

  Also, to help debugging, destroy_workqueue() is more chatty now on a
  sanity check failure."

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: move wq_numa_init() to workqueue_init()
  workqueue: remove keventd_up()
  debugobj, workqueue: remove keventd_up() usage
  slab, workqueue: remove keventd_up() usage
  power, workqueue: remove keventd_up() usage
  tty, workqueue: remove keventd_up() usage
  mce, workqueue: remove keventd_up() usage
  workqueue: make workqueue available early during boot
  workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
2016-12-13 12:59:57 -08:00
Linus Torvalds e7aa8c2eb1 These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
 continues.  Highlights include:
 
  - Further work on PDF output, which remains a bit of a pain but should be
    more solid now.
 
  - Five more DocBook template files converted to Sphinx.  Only 27 to go...
    Lots of plain-text files have also been converted and integrated.
 
  - Images in binary formats have been replaced with more source-friendly
    versions.
 
  - Various bits of organizational work, including the renaming of various
    files discussed at the kernel summit.
 
  - New documentation for the device_link mechanism.
 
 ...and, of course, lots of typo fixes and small updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
 sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
 T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
 XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
 BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
 r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
 QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
 RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
 zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
 A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
 bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
 pmSJR0hVeRUmA4uw6+Su
 =A0EV
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.10' of git://git.lwn.net/linux

Pull documentation update from Jonathan Corbet:
 "These are the documentation changes for 4.10.

  It's another busy cycle for the docs tree, as the sphinx conversion
  continues. Highlights include:

   - Further work on PDF output, which remains a bit of a pain but
     should be more solid now.

   - Five more DocBook template files converted to Sphinx. Only 27 to
     go... Lots of plain-text files have also been converted and
     integrated.

   - Images in binary formats have been replaced with more
     source-friendly versions.

   - Various bits of organizational work, including the renaming of
     various files discussed at the kernel summit.

   - New documentation for the device_link mechanism.

  ... and, of course, lots of typo fixes and small updates"

* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
  dma-buf: Extract dma-buf.rst
  Update Documentation/00-INDEX
  docs: 00-INDEX: document directories/files with no docs
  docs: 00-INDEX: remove non-existing entries
  docs: 00-INDEX: add missing entries for documentation files/dirs
  docs: 00-INDEX: consolidate process/ and admin-guide/ description
  scripts: add a script to check if Documentation/00-INDEX is sane
  Docs: change sh -> awk in REPORTING-BUGS
  Documentation/core-api/device_link: Add initial documentation
  core-api: remove an unexpected unident
  ppc/idle: Add documentation for powersave=off
  Doc: Correct typo, "Introdution" => "Introduction"
  Documentation/atomic_ops.txt: convert to ReST markup
  Documentation/local_ops.txt: convert to ReST markup
  Documentation/assoc_array.txt: convert to ReST markup
  docs-rst: parse-headers.pl: cleanup the documentation
  docs-rst: fix media cleandocs target
  docs-rst: media/Makefile: reorganize the rules
  docs-rst: media: build SVG from graphviz files
  docs-rst: replace bayer.png by a SVG image
  ...
2016-12-12 21:58:13 -08:00
Linus Torvalds e34bac726d Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - most of MM (quite a lot of MM material is awaiting the merge of
   linux-next dependencies)

 - kasan

 - printk updates

 - procfs updates

 - MAINTAINERS

 - /lib updates

 - checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
  init: reduce rootwait polling interval time to 5ms
  binfmt_elf: use vmalloc() for allocation of vma_filesz
  checkpatch: don't emit unified-diff error for rename-only patches
  checkpatch: don't check c99 types like uint8_t under tools
  checkpatch: avoid multiple line dereferences
  checkpatch: don't check .pl files, improve absolute path commit log test
  scripts/checkpatch.pl: fix spelling
  checkpatch: don't try to get maintained status when --no-tree is given
  lib/ida: document locking requirements a bit better
  lib/rbtree.c: fix typo in comment of ____rb_erase_color
  lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
  MAINTAINERS: add drm and drm/i915 irc channels
  MAINTAINERS: add "C:" for URI for chat where developers hang out
  MAINTAINERS: add drm and drm/i915 bug filing info
  MAINTAINERS: add "B:" for URI where to file bugs
  get_maintainer: look for arbitrary letter prefixes in sections
  printk: add Kconfig option to set default console loglevel
  printk/sound: handle more message headers
  printk/btrfs: handle more message headers
  printk/kdb: handle more message headers
  ...
2016-12-12 20:50:02 -08:00
Linus Torvalds 9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Jungseung Lee 39a0e975c3 init: reduce rootwait polling interval time to 5ms
For several devices, the rootwait time is sensitive because it directly
affects booting time.  The polling interval of rootwait is currently
100ms.  To save unnessesary waiting time, reduce the polling interval to
5 ms.

[akpm@linux-foundation.org: remove used-once #define]
Link: http://lkml.kernel.org/r/20161207060743.1728-1-js07.lee@samsung.com
Signed-off-by: Jungseung Lee <js07.lee@samsung.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:10 -08:00
Linus Torvalds 212f30008a Merge branch 'x86-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 idle updates from Ingo Molnar:
 "There were two bigger changes in this development cycle:

   - remove idle notifiers:

       32 files changed, 74 insertions(+), 803 deletions(-)

     These notifiers were of questionable value and the main usecase,
     the i7300 driver, was essentially unmaintained and can be removed,
     plus modern power management concepts don't need the callback - so
     use this golden opportunity and get rid of this opaque and fragile
     callback from a latency sensitive code path.

     (Len Brown, Thomas Gleixner)

   - improve the AMD Erratum 400 workaround that used high overhead MSR
     polling in the idle loop (Borisla Petkov, Thomas Gleixner)"

* 'x86-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove empty idle.h header
  x86/amd: Simplify AMD E400 aware idle routine
  x86/amd: Check for the C1E bug post ACPI subsystem init
  x86/bugs: Separate AMD E400 erratum and C1E bug
  x86/cpufeature: Provide helper to set bugs bits
  x86/idle: Remove enter_idle(), exit_idle()
  x86: Remove x86_test_and_clear_bit_percpu()
  x86/idle: Remove is_idle flag
  x86/idle: Remove idle_notifier
  i7300_idle: Remove this driver
2016-12-12 14:55:04 -08:00
Thomas Gleixner e7ff3a4763 x86/amd: Check for the C1E bug post ACPI subsystem init
AMD CPUs affected by the E400 erratum suffer from the issue that the
local APIC timer stops when the CPU goes into C1E. Unfortunately there
is no way to detect the affected CPUs on early boot. It's only possible
to determine the range of possibly affected CPUs from the family/model
range.

The actual decision whether to enter C1E and thus cause the bug is done
by the firmware and we need to detect that case late, after ACPI has
been initialized.

The current solution is to check in the idle routine whether the CPU is
affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the
K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is
affected and the system is switched into forced broadcast mode.

This is ineffective and on non-affected CPUs every entry to idle does
the extra RDMSR.

After doing some research it turns out that the bits are visible on the
boot CPU right after the ACPI subsystem is initialized in the early
boot process. So instead of polling for the bits in the idle loop, add
a detection function after acpi_subsystem_init() and check for the MSR
bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and
the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it
will stop in C1E state as well.

The switch to broadcast mode cannot be done at this point because the
boot CPU still uses HPET as a clockevent device and the local APIC timer
is not yet calibrated and installed. The switch to broadcast mode on the
affected CPUs needs to be done when the local APIC timer is actually set
up.

This allows to cleanup the amd_e400_idle() function in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 21:23:21 +01:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Linus Torvalds faaae2a581 Re-enable CONFIG_MODVERSIONS in a slightly weaker form
This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC
information in order to work around the issue that newer binutils
versions seem to occasionally drop the CRC on the floor.  binutils 2.26
seems to work fine, while binutils 2.27 seems to break MODVERSIONS of
symbols that have been defined in assembler files.

[ We've had random missing CRC's before - it may be an old problem that
  just is now reliably triggered with the weak asm symbols and a new
  version of binutils ]

Some day I really do want to remove MODVERSIONS entirely.  Sadly, today
does not appear to be that day: Debian people apparently do want the
option to enable MODVERSIONS to make it easier to have external modules
across kernel versions, and this seems to be a fairly minimal fix for
the annoying problem.

Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-29 16:01:30 -08:00
Arnd Bergmann 4d217a5adc module: fix DEBUG_SET_MODULE_RONX typo
The newly added 'rodata_enabled' global variable is protected by
the wrong #ifdef, leading to a link error when CONFIG_DEBUG_SET_MODULE_RONX
is turned on:

kernel/module.o: In function `disable_ro_nx':
module.c:(.text.unlikely.disable_ro_nx+0x88): undefined reference to `rodata_enabled'
kernel/module.o: In function `module_disable_ro':
module.c:(.text.module_disable_ro+0x8c): undefined reference to `rodata_enabled'
kernel/module.o: In function `module_enable_ro':
module.c:(.text.module_enable_ro+0xb0): undefined reference to `rodata_enabled'

CONFIG_SET_MODULE_RONX does not exist, so use the correct one instead.

Fixes: 39290b389e ("module: extend 'rodata=off' boot cmdline parameter to module mappings")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-28 11:37:57 -08:00
AKASHI Takahiro 39290b389e module: extend 'rodata=off' boot cmdline parameter to module mappings
The current "rodata=off" parameter disables read-only kernel mappings
under CONFIG_DEBUG_RODATA:
    commit d2aa1acad2 ("mm/init: Add 'rodata=off' boot cmdline parameter
    to disable read-only kernel mappings")

This patch is a logical extension to module mappings ie. read-only mappings
at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX
(mainly for debug use). Please note, however, that it only affects RO/RW
permissions, keeping NX set.

This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory
(always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64.

Suggested-by: and Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-27 16:15:33 -08:00
David S. Miller 0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Linus Torvalds cd3caefb46 Fix subtle CONFIG_MODVERSIONS problems
CONFIG_MODVERSIONS has been broken for pretty much the whole 4.9 series,
and quite frankly, nobody has cared very deeply.  We absolutely know how
to fix it, and it's not _complicated_, but it's not exactly pretty
either.

This oneliner fixes it without the ugliness, and allows for further
future cleanups.

  "We've secretly replaced their regular MODVERSIONS with nothing at
   all, let's see if they notice"

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-25 15:44:47 -08:00
Daniel Mack 3007098494 cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
Nicolas Schichan 18594e9bc4 init: use pr_cont() when displaying rotator during ramdisk loading.
Otherwise each individual rotator char would be printed in a new line:

(...)
[    0.642350] -
[    0.644374] |
[    0.646367] -
(...)

Signed-off-by: Nicolas Schichan <nicolas.schichan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-24 09:32:20 -08:00
Nicolas Pitre baa73d9e47 posix-timers: Make them configurable
Some embedded systems have no use for them.  This removes about
25KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:35 +01:00
Mauro Carvalho Chehab 8c27ceff36 docs: fix locations of several documents that got moved
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2016-10-24 08:12:35 -02:00
Tejun Heo 8bc4a04455 Merge branch 'for-4.9' into for-4.10 2016-10-19 12:12:40 -04:00
Linus Torvalds 9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Linus Torvalds 84d69848c9 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:

 - EXPORT_SYMBOL for asm source by Al Viro.

   This does bring a regression, because genksyms no longer generates
   checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
   working on a patch to fix this.

   Plus, we are talking about functions like strcpy(), which rarely
   change prototypes.

 - Fixes for PPC fallout of the above by Stephen Rothwell and Nick
   Piggin

 - fixdep speedup by Alexey Dobriyan.

 - preparatory work by Nick Piggin to allow architectures to build with
   -ffunction-sections, -fdata-sections and --gc-sections

 - CONFIG_THIN_ARCHIVES support by Stephen Rothwell

 - fix for filenames with colons in the initramfs source by me.

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
  initramfs: Escape colons in depfile
  ppc: there is no clear_pages to export
  powerpc/64: whitelist unresolved modversions CRCs
  kbuild: -ffunction-sections fix for archs with conflicting sections
  kbuild: add arch specific post-link Makefile
  kbuild: allow archs to select link dead code/data elimination
  kbuild: allow architectures to use thin archives instead of ld -r
  kbuild: Regenerate genksyms lexer
  kbuild: genksyms fix for typeof handling
  fixdep: faster CONFIG_ search
  ia64: move exports to definitions
  sparc32: debride memcpy.S a bit
  [sparc] unify 32bit and 64bit string.h
  sparc: move exports to definitions
  ppc: move exports to definitions
  arm: move exports to definitions
  s390: move exports to definitions
  m68k: move exports to definitions
  alpha: move exports to actual definitions
  x86: move exports to actual definitions
  ...
2016-10-14 14:26:58 -07:00
Peter Zijlstra 26b5679e43 relay: Use irq_work instead of plain timer for deferred wakeup
Relay avoids calling wake_up_interruptible() for doing the wakeup of
readers/consumers, waiting for the generation of new data, from the
context of a process which produced the data.  This is apparently done to
prevent the possibility of a deadlock in case Scheduler itself is is
generating data for the relay, after acquiring rq->lock.

The following patch used a timer (to be scheduled at next jiffy), for
delegating the wakeup to another context.
	commit 7c9cb38302
	Author: Tom Zanussi <zanussi@comcast.net>
	Date:   Wed May 9 02:34:01 2007 -0700

	relay: use plain timer instead of delayed work

	relay doesn't need to use schedule_delayed_work() for waking readers
	when a simple timer will do.

Scheduling a plain timer, at next jiffies boundary, to do the wakeup
causes a significant wakeup latency for the Userspace client, which makes
relay less suitable for the high-frequency low-payload use cases where the
data gets generated at a very high rate, like multiple sub buffers getting
filled within a milli second.  Moreover the timer is re-scheduled on every
newly produced sub buffer so the timer keeps getting pushed out if sub
buffers are filled in a very quick succession (less than a jiffy gap
between filling of 2 sub buffers).  As a result relay runs out of sub
buffers to store the new data.

By using irq_work it is ensured that wakeup of userspace client, blocked
in the poll call, is done at earliest (through self IPI or next timer
tick) enabling it to always consume the data in time.  Also this makes
relay consistent with printk & ring buffers (trace), as they too use
irq_work for deferred wake up of readers.

[arnd@arndb.de: select CONFIG_IRQ_WORK]
 Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Akash Goel <akash.goel@intel.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Emese Revfy 38addce8b6 gcc-plugins: Add latent_entropy plugin
This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.

The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).

To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).

Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.

Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:44 -07:00
Linus Torvalds 997b611baf Merge branch 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
 "Changes include:

   - Fix boot of 32bit SMP kernel (initial kernel mapping was too small)

   - Added hardened usercopy checks

   - Drop bootmem and switch to memblock and NO_BOOTMEM implementation

   - Drop the BROKEN_RODATA config option (and thus remove the relevant
     code from the generic headers and files because parisc was the last
     architecture which used this config option)

   - Improve segfault reporting by printing human readable error strings

   - Various smaller changes, e.g. dwarf debug support for assembly
     code, update comments regarding copy_user_page_asm, switch to
     kmalloc_array()"

* 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels
  parisc: Drop bootmem and switch to memblock
  parisc: Add hardened usercopy feature
  parisc: Add cfi_startproc and cfi_endproc to assembly code
  parisc: Move hpmc stack into page aligned bss section
  parisc: Fix self-detected CPU stall warnings on Mako machines
  parisc: Report trap type as human readable string
  parisc: Update comment regarding implementation of copy_user_page_asm
  parisc: Use kmalloc_array() in add_system_map_addresses()
  parisc: Check return value of smp_boot_one_cpu()
  parisc: Drop BROKEN_RODATA config option
2016-10-07 20:50:37 -07:00
Helge Deller b5d5cf2b8a parisc: Drop BROKEN_RODATA config option
PARISC was the only architecture which selected the BROKEN_RODATA config
option. Drop it and remove the special handling from init.h as well.

Signed-off-by: Helge Deller <deller@gmx.de>
2016-09-20 18:02:35 +02:00
Tejun Heo 3347fa0928 workqueue: make workqueue available early during boot
Workqueue is currently initialized in an early init call; however,
there are cases where early boot code has to be split and reordered to
come after workqueue initialization or the same code path which makes
use of workqueues is used both before workqueue initailization and
after.  The latter cases have to gate workqueue usages with
keventd_up() tests, which is nasty and easy to get wrong.

Workqueue usages have become widespread and it'd be a lot more
convenient if it can be used very early from boot.  This patch splits
workqueue initialization into two steps.  workqueue_init_early() which
sets up the basic data structures so that workqueues can be created
and work items queued, and workqueue_init() which actually brings up
workqueues online and starts executing queued work items.  The former
step can be done very early during boot once memory allocation,
cpumasks and idr are initialized.  The latter right after kthreads
become available.

This allows work item queueing and canceling from very early boot
which is what most of these use cases want.

* As systemd_wq being initialized doesn't indicate that workqueue is
  fully online anymore, update keventd_up() to test wq_online instead.
  The follow-up patches will get rid of all its usages and the
  function itself.

* Flushing doesn't make sense before workqueue is fully initialized.
  The flush functions trigger WARN and return immediately before fully
  online.

* Work items are never in-flight before fully online.  Canceling can
  always succeed by skipping the flush step.

* Some code paths can no longer assume to be called with irq enabled
  as irq is disabled during early boot.  Use irqsave/restore
  operations instead.

v2: Watchdog init, which requires timer to be running, moved from
    workqueue_init_early() to workqueue_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-17 13:18:21 -04:00
Andy Lutomirski c6c314a613 sched/core: Add try_get_task_stack() and put_task_stack()
There are a few places in the kernel that access stack memory
belonging to a different task.  Before we can start freeing task
stacks before the task_struct is freed, we need a way for those code
paths to pin the stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/17a434f50ad3d77000104f21666575e10a9c1fbd.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:53 +02:00
Andy Lutomirski c65eacbe29 sched/core: Allow putting thread_info into task_struct
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:25:13 +02:00
Nicholas Piggin b67067f117 kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.

On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.

    text      data        bss        dec   filename
11169741   1180744    1923176	14273661   vmlinux
10445269   1004127    1919707	13369103   vmlinux.dce

~700K text, ~170K data, 6% removed from kernel image size.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-09-09 10:47:00 +02:00
Linus Torvalds 1eccfa090e Implements HARDENED_USERCOPY verification of copy_to_user/copy_from_user
bounds checking for most architectures on SLAB and SLUB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJXl9tlAAoJEIly9N/cbcAm5BoP/ikTtDp2bFw1sn92yHTnIWzl
 O+dcKVAeRgjfnSvPfb1JITpaM58exQSaDsPBeR0DbVzU1zDdhLcwHHiQupFh98Ka
 vBZthbrlL/u4NB26enEEW0iyA32BsxYBMnIu0z5ux9RbZflmQwGQ0c0rvy3dJ7/b
 FzB5ayVST5y/a0m6/sImeeExh78GU9rsMb1XmJRMwlJAy6miDz/F9TP0LnuW6PhG
 J5XC99ygNJS1pQBLACRsrZw6ImgBxXnWCok6tWPMxFfD+rJBU2//wqS+HozyMWHL
 iYP7+ytVo/ZVok4114X/V4Oof3a6wqgpBuYrivJ228QO+UsLYbYLo6sZ8kRK7VFm
 9GgHo/8rWB1T9lBbSaa7UL5r0dVNNLjFGS42vwV+YlgUMQ1A35VRojO0jUnJSIQU
 Ug1IxKmylLd0nEcwD8/l3DXeQABsfL8GsoKW0OtdTZtW4RND4gzq34LK6t7hvayF
 kUkLg1OLNdUJwOi16M/rhugwYFZIMfoxQtjkRXKWN4RZ2QgSHnx2lhqNmRGPAXBG
 uy21wlzUTfLTqTpoeOyHzJwyF2qf2y4nsziBMhvmlrUvIzW1LIrYUKCNT4HR8Sh5
 lC2WMGYuIqaiu+NOF3v6CgvKd9UW+mxMRyPEybH8mEgfm+FLZlWABiBjIUpSEZuB
 JFfuMv1zlljj/okIQRg8
 =USIR
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull usercopy protection from Kees Cook:
 "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
  copy_from_user bounds checking for most architectures on SLAB and
  SLUB"

* tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mm: SLUB hardened usercopy support
  mm: SLAB hardened usercopy support
  s390/uaccess: Enable hardened usercopy
  sparc/uaccess: Enable hardened usercopy
  powerpc/uaccess: Enable hardened usercopy
  ia64/uaccess: Enable hardened usercopy
  arm64/uaccess: Enable hardened usercopy
  ARM: uaccess: Enable hardened usercopy
  x86/uaccess: Enable hardened usercopy
  mm: Hardened usercopy
  mm: Implement stack frame object validation
  mm: Add is_migrate_cma_page
2016-08-08 14:48:14 -07:00
Valdis Kletnieks f1cb637e75 init/Kconfig: add clarification for out-of-tree modules
It doesn't trim just symbols that are totally unused in-tree - it trims
the symbols unused by any in-tree modules actually built.  If you've
done a 'make localmodconfig' and only build a hundred or so modules,
it's pretty likely that your out-of-tree module will come up lacking
something...

Hopefully this will save the next guy from a Homer Simpson "D'oh!"
moment.

Link: http://lkml.kernel.org/r/10177.1469787292@turing-police.cc.vt.edu
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:43 -04:00
Alexey Dobriyan ac3339baff init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
Doing patches with allmodconfig kernel compiled and committing stuff
into local tree have unfortunate consequence: kernel version changes (as
it should) leading to recompiling and relinking of several files even if
they weren't touched (or interesting at all).  This and "git-whatever"
figuring out current version slow down compilation for no good reason.

But lets face it, "allmodconfig" kernels don't care about kernel
version, they are simply compile check guinea pigs.

Make LOCALVERSION_AUTO depend on !COMPILE_TEST, so it doesn't sneak into
allmodconfig .config.

Link: http://lkml.kernel.org/r/20160707214954.GC31678@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:41 -04:00
Prarit Bhargava 841c06d71e init: allow blacklisting of module_init functions
sprint_symbol_no_offset() returns the string "function_name
[module_name]" where [module_name] is not printed for built in kernel
functions.  This means that the blacklisting code will fail when
comparing module function names with the extended string.

This patch adds the functionality to block a module's module_init()
function by finding the space in the string and truncating the
comparison to that length.

Link: http://lkml.kernel.org/r/1466124387-20446-1-git-send-email-prarit@redhat.com
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:40 -04:00
Fabian Frederick bd721ea73e treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Richard Weinberger bc083a64b6 init/Kconfig: make COMPILE_TEST depend on !UML
UML is a bit special since it does not have iomem nor dma.  That means a
lot of drivers will not build if they miss a dependency on HAS_IOMEM.
s390 used to have the same issues but since it gained PCI support UML is
the only stranger.

We are tired of patching dozens of new drivers after every merge window
just to un-break allmod/yesconfig UML builds.  One could argue that a
decent driver has to know on what it depends and therefore a missing
HAS_IOMEM dependency is a clear driver bug.  But the dependency not
obvious and not everyone does UML builds with COMPILE_TEST enabled when
developing a device driver.

A possible solution to make these builds succeed on UML would be
providing stub functions for ioremap() and friends which fail upon
runtime.  Another one is simply disabling COMPILE_TEST for UML.  Since
it is the least hassle and does not force use to fake iomem support
let's do the latter.

Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
seokhoon.yoon 9991a9c8db cgroup: update cgroup's document path
cgroup's document path is changed to "cgroup-v1".  update it.

Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon <iamyooon@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Linus Torvalds 69c4289449 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  fat: fix error message for bogus number of directory entries
  fat: fix typo s/supeblock/superblock/
  ASoC: max9877: Remove unused function declaration
  dw2102: don't output spurious blank lines to the kernel log
  init: fix Kconfig text
  ARM: io: fix comment grammar
  ocfs: fix ocfs2_xattr_user_get() argument name
  scsi/qla2xxx: Remove erroneous unused macro qla82xx_get_temp_val1()
2016-07-28 14:22:25 -07:00
Thomas Garnier 210e7a43fa mm: SLUB freelist randomization
Implements freelist randomization for the SLUB allocator.  It was
previous implemented for the SLAB allocator.  Both use the same
configuration option (CONFIG_SLAB_FREELIST_RANDOM).

The list is randomized during initialization of a new set of pages.  The
order on different freelist sizes is pre-computed at boot for
performance.  Each kmem_cache has its own randomized freelist.

This security feature reduces the predictability of the kernel SLUB
allocator against heap overflows rendering attacks much less stable.

For example these attacks exploit the predictability of the heap:
 - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
 - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

Performance results:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp.  It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
  100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
  100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
  100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
  100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
  100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
  100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
  100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
  100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
  100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 70 cycles
  100000 times kmalloc(16)/kfree -> 70 cycles
  100000 times kmalloc(32)/kfree -> 70 cycles
  100000 times kmalloc(64)/kfree -> 70 cycles
  100000 times kmalloc(128)/kfree -> 70 cycles
  100000 times kmalloc(256)/kfree -> 69 cycles
  100000 times kmalloc(512)/kfree -> 70 cycles
  100000 times kmalloc(1024)/kfree -> 73 cycles
  100000 times kmalloc(2048)/kfree -> 72 cycles
  100000 times kmalloc(4096)/kfree -> 71 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
  100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
  100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
  100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
  100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
  100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
  100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
  100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
  100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
  100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 66 cycles
  100000 times kmalloc(16)/kfree -> 66 cycles
  100000 times kmalloc(32)/kfree -> 66 cycles
  100000 times kmalloc(64)/kfree -> 66 cycles
  100000 times kmalloc(128)/kfree -> 65 cycles
  100000 times kmalloc(256)/kfree -> 67 cycles
  100000 times kmalloc(512)/kfree -> 67 cycles
  100000 times kmalloc(1024)/kfree -> 64 cycles
  100000 times kmalloc(2048)/kfree -> 67 cycles
  100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 101.873 (1.16069)
  User Time 1045.22 (1.60447)
  System Time 88.969 (0.559195)
  Percent CPU 1112.9 (13.8279)
  Context Switches 189140 (2282.15)
  Sleeps 99008.6 (768.091)

After:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 102.47 (0.562732)
  User Time 1045.3 (1.34263)
  System Time 88.311 (0.342554)
  Percent CPU 1105.8 (6.49444)
  Context Switches 189081 (2355.78)
  Sleeps 99231.5 (800.358)

Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kees Cook ed18adc1cd mm: SLUB hardened usercopy support
Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLUB allocator to catch any copies that may span objects. Includes a
redzone handling fix discovered by Michael Ellerman.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviwed-by: Laura Abbott <labbott@redhat.com>
2016-07-26 14:43:54 -07:00
Kees Cook 04385fc5e8 mm: SLAB hardened usercopy support
Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLAB allocator to catch any copies that may span objects.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
2016-07-26 14:41:53 -07:00
Linus Torvalds 766fd5f6cd Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar:

 - fix system/idle cputime leaked on cputime accounting (all nohz
   configs) (Rik van Riel)

 - remove the messy, ad-hoc irqtime account on nohz-full and make it
   compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

 - cleanups (Frederic Weisbecker)

 - remove unecessary irq disablement in the irqtime code (Rik van Riel)

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
  sched/cputime: Reorganize vtime native irqtime accounting headers
  sched/cputime: Clean up the old vtime gen irqtime accounting completely
  sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
  sched/cputime: Count actually elapsed irq & softirq time
2016-07-25 14:43:00 -07:00
Linus Torvalds df00ccca72 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - documentation updates

   - miscellaneous fixes

   - minor reorganization of code

   - torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  rcu: Correctly handle sparse possible cpus
  rcu: sysctl: Panic on RCU Stall
  rcu: Fix a typo in a comment
  rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
  rcu: Disable TASKS_RCU for usermode Linux
  rcu: No ordering for rcu_assign_pointer() of NULL
  rcutorture: Fix error return code in rcu_perf_init()
  torture: Inflict default jitter
  rcuperf: Don't treat gp_exp mis-setting as a WARN
  rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
  rcutorture: Don't specify the cpu type of QEMU on PPC
  rcutorture: Make -soundhw a x86 specific option
  rcutorture: Use vmlinux as the fallback kernel image
  rcutorture/doc: Create initrd using dracut
  torture: Stop onoff task if there is only one cpu
  torture: Add starvation events to error summary
  torture:  Break online and offline functions out of torture_onoff()
  torture: Forgive lengthy trace dumps and preemption
  torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
  torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
  ...
2016-07-25 12:04:11 -07:00
Rik van Riel b58c358405 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:34 +02:00
Randy Dunlap 076501ff6b init/Kconfig: keep Expert users menu together
The "expert" menu was broken (split) such that all entries in it after
KALLSYMS were displayed in the "General setup" area instead of in the
"Expert users" area.  Fix this by adding one kconfig dependency.

Yes, the Expert users menu is fragile.  Problems like this have happened
several times in the past.  I will attempt to isolate the Expert users
menu if there is interest in that.

Fixes: 4d5d5664c9 ("x86: kallsyms: disable absolute percpu symbols on !SMP")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: stable@vger.kernel.org  # 4.6
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-06 16:27:20 -07:00
Ingo Molnar 54d5f16e55 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Documentation updates.  Just some simple changes, no design-level
   additions.

 - Miscellaneous fixes.

 - Torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30 08:27:41 +02:00
Linus Torvalds 086e3eb65e Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Two weeks worth of fixes here"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
  init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
  autofs: don't get stuck in a loop if vfs_write() returns an error
  mm/page_owner: avoid null pointer dereference
  tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
  fs/nilfs2: fix potential underflow in call to crc32_le
  oom, suspend: fix oom_reaper vs. oom_killer_disable race
  ocfs2: disable BUG assertions in reading blocks
  mm, compaction: abort free scanner if split fails
  mm: prevent KASAN false positives in kmemleak
  mm/hugetlb: clear compound_mapcount when freeing gigantic pages
  mm/swap.c: flush lru pvecs on compound page arrival
  memcg: css_alloc should return an ERR_PTR value on error
  memcg: mem_cgroup_migrate() may be called with irq disabled
  hugetlb: fix nr_pmds accounting with shared page tables
  Revert "mm: disable fault around on emulated access bit architecture"
  Revert "mm: make faultaround produce old ptes"
  mailmap: add Boris Brezillon's email
  mailmap: add Antoine Tenart's email
  mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
  mm: mempool: kasan: don't poot mempool objects in quarantine
  ...
2016-06-24 19:08:33 -07:00
Rasmus Villemoes 0fd5ed8d89 init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
When I replaced kasprintf("%pf") with a direct call to
sprint_symbol_no_offset I must have broken the initcall blacklisting
feature on the arches where dereference_function_descriptor() is
non-trivial.

Fixes: c8cdd2be21 (init/main.c: simplify initcall_blacklisted())
Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Linus Torvalds b235beea9e Clarify naming of thread info/stack allocators
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.

But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.

Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical.  That
identity then meant that we would have things like

	ti = alloc_thread_info_node(tsk, node);
	...
	tsk->stack = ti;

which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.

So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack.  The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.

This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.

The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter.  It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.

Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 15:09:37 -07:00
Geert Uytterhoeven 5e0d8d59a5 init: fix Kconfig text
[jkosina@suse.cz: folded another fix on top on the same line as spotted by
 Randy Dunlap]
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-06-21 13:25:13 +02:00
Paul E. McKenney 570dd3c742 rcu: Disable TASKS_RCU for usermode Linux
Usermode Linux currently does not implement arch_irqs_disabled_flags(),
which results in a build failure in TASKS_RCU.  Therefore, this commit
disables the TASKS_RCU Kconfig option in usermode Linux builds.  The
usermode Linux maintainers expect to merge arch_irqs_disabled_flags()
into 4.8, at which point this commit may be reverted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Richard Weinberger <richard@nod.at>
2016-06-15 15:32:01 -07:00
Yang Shi fe53ca5427 mm: use early_pfn_to_nid in page_ext_init
page_ext_init() checks suitable pages with pfn_to_nid(), but
pfn_to_nid() depends on memmap which will not be setup fully until
page_alloc_init_late() is done.  Use early_pfn_to_nid() instead of
pfn_to_nid() so that page extension could be still used early even
though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
allocation call sites.

Suggested by Joonsoo Kim [1], this fix basically undoes the change
introduced by commit b8f1a75d61 ("mm: call page_ext_init() after all
struct pages are initialized") and fixes the same problem with a better
approach.

[1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Linus Torvalds 5b26fc8824 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:

 - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
   unexports symbols which are not used in the current config [Nicolas
   Pitre]

 - several kbuild rule cleanups [Masahiro Yamada]

 - warning option adjustments for gcov etc [Arnd Bergmann]

 - a few more small fixes

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
  kbuild: move -Wunused-const-variable to W=1 warning level
  kbuild: fix if_change and friends to consider argument order
  kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
  kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
  gcov: disable -Wmaybe-uninitialized warning
  gcov: disable tree-loop-im to reduce stack usage
  gcov: disable for COMPILE_TEST
  Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
  Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
  kbuild: forbid kernel directory to contain spaces and colons
  kbuild: adjust ksym_dep_filter for some cmd_* renames
  kbuild: Fix dependencies for final vmlinux link
  kbuild: better abstract vmlinux sequential prerequisites
  kbuild: fix call to adjust_autoksyms.sh when output directory specified
  kbuild: Get rid of KBUILD_STR
  kbuild: rename cmd_as_s_S to cmd_cpp_s_S
  kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
  kbuild: drop redundant "PHONY += FORCE"
  kbuild: delete unnecessary "@:"
  kbuild: mark help target as PHONY
  ...
2016-05-26 22:01:22 -07:00
Rasmus Villemoes c8cdd2be21 init/main.c: simplify initcall_blacklisted()
Using kasprintf to get the function name makes us look up the name
twice, along with all the vsnprintf overhead of parsing the format
string etc.  It also means there is an allocation failure case to deal
with.  Since symbol_string in vsprintf.c would anyway allocate an array
of size KSYM_SYMBOL_LEN on the stack, that might as well be done up
here.

Moreover, since this is a debug feature and the blacklisted_initcalls
list is usually empty, we might as well test that and thus avoid looking
up the symbol name even once in the common case.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Petr Mladek 427934b871 printk/nmi: increase the size of NMI buffer and make it configurable
Testing has shown that the backtrace sometimes does not fit into the 4kB
temporary buffer that is used in NMI context.  The warnings are gone
when I double the temporary buffer size.

This patch doubles the buffer size and makes it configurable.

Note that this problem existed even in the x86-specific implementation
that was added by the commit a9edc88093 ("x86/nmi: Perform a safe NMI
stack trace on all CPUs").  Nobody noticed it because it did not print
any warnings.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Petr Mladek 42a0bb3f71 printk/nmi: generic solution for safe printk in NMI
printk() takes some locks and could not be used a safe way in NMI
context.

The chance of a deadlock is real especially when printing stacks from
all CPUs.  This particular problem has been addressed on x86 by the
commit a9edc88093 ("x86/nmi: Perform a safe NMI stack trace on all
CPUs").

The patchset brings two big advantages.  First, it makes the NMI
backtraces safe on all architectures for free.  Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited.  We still should keep the number of messages in NMI context at
minimum).

Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers.  These are not easy to avoid.

This patch reuses most of the code and makes it generic.  It is useful
for all messages and architectures that support NMI.

The alternative printk_func is set when entering and is reseted when
leaving NMI context.  It queues IRQ work to copy the messages into the
main ring buffer in a safe context.

__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers.  There is also used a spinlock to get synchronized with other
flushers.

We do not longer use seq_buf because it depends on external lock.  It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.

The code is put into separate printk/nmi.c as suggested by Steven
Rostedt.  It needs a per-CPU buffer and is compiled only on
architectures that call nmi_enter().  This is achieved by the new
HAVE_NMI Kconfig flag.

The are MN10300 and Xtensa architectures.  We need to clean up NMI
handling there first.  Let's do it separately.

The patch is heavily based on the draft from Peter Zijlstra, see

  https://lkml.org/lkml/2015/6/10/327

[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Yang Shi b8f1a75d61 mm: call page_ext_init() after all struct pages are initialized
When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
boot are initialized, then the rest are initialized in parallel by
starting one-off "pgdatinitX" kernel thread for each node X.

If page_ext_init is called before it, some pages will not have valid
extension, this may lead the below kernel oops when booting up kernel:

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
  PGD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in:
  CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
  Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
  task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
  RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
  RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
  RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
  RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
  R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
  R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
  FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
  Call Trace:
    free_hot_cold_page+0x192/0x1d0
    __free_pages+0x5c/0x90
    __free_pages_boot_core+0x11a/0x14e
    deferred_free_range+0x50/0x62
    deferred_init_memmap+0x220/0x3c3
    kthread+0xf8/0x110
    ret_from_fork+0x22/0x40
  Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
  RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
   RSP <ffff88017c087c48>
  CR2: 0000000000000000

Move page_ext_init() after page_alloc_init_late() to make sure page extension
is setup for all pages.

Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Thomas Garnier c7ce4f60ac mm: SLAB freelist randomization
Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
the SLAB freelist.  The list is randomized during initialization of a
new set of pages.  The order on different freelist sizes is pre-computed
at boot for performance.  Each kmem_cache has its own randomized
freelist.  Before pre-computed lists are available freelists are
generated dynamically.  This security feature reduces the predictability
of the kernel SLAB allocator against heap overflows rendering attacks
much less stable.

For example this attack against SLUB (also applicable against SLAB)
would be affected:

  https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

Also, since v4.6 the freelist was moved at the end of the SLAB.  It
means a controllable heap is opened to new attacks not yet publicly
discussed.  A kernel heap overflow can be transformed to multiple
use-after-free.  This feature makes this type of attack harder too.

To generate entropy, we use get_random_bytes_arch because 0 bits of
entropy is available in the boot stage.  In the worse case this function
will fallback to the get_random_bytes sub API.  We also generate a shift
random number to shift pre-computed freelist for each new set of pages.

The config option name is not specific to the SLAB as this approach will
be extended to other allocators like SLUB.

Performance results highlighted no major changes:

Hackbench (running 90 10 times):

  Before average: 0.0698
  After average: 0.0663 (-5.01%)

slab_test 1 run on boot.  Difference only seen on the 2048 size test
being the worse case scenario covered by freelist randomization.  New
slab pages are constantly being created on the 10000 allocations.
Variance should be mainly due to getting new pages every few
allocations.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
  10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
  10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
  10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
  10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
  10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
  10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
  10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
  10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
  10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
  10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
  10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
  2. Kmalloc: alloc/free test
  10000 times kmalloc(8)/kfree -> 121 cycles
  10000 times kmalloc(16)/kfree -> 121 cycles
  10000 times kmalloc(32)/kfree -> 121 cycles
  10000 times kmalloc(64)/kfree -> 121 cycles
  10000 times kmalloc(128)/kfree -> 121 cycles
  10000 times kmalloc(256)/kfree -> 119 cycles
  10000 times kmalloc(512)/kfree -> 119 cycles
  10000 times kmalloc(1024)/kfree -> 119 cycles
  10000 times kmalloc(2048)/kfree -> 119 cycles
  10000 times kmalloc(4096)/kfree -> 121 cycles
  10000 times kmalloc(8192)/kfree -> 119 cycles
  10000 times kmalloc(16384)/kfree -> 119 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
  10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
  10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
  10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
  10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
  10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
  10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
  10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
  10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
  10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
  10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
  10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
  2. Kmalloc: alloc/free test
  10000 times kmalloc(8)/kfree -> 121 cycles
  10000 times kmalloc(16)/kfree -> 121 cycles
  10000 times kmalloc(32)/kfree -> 123 cycles
  10000 times kmalloc(64)/kfree -> 142 cycles
  10000 times kmalloc(128)/kfree -> 121 cycles
  10000 times kmalloc(256)/kfree -> 119 cycles
  10000 times kmalloc(512)/kfree -> 119 cycles
  10000 times kmalloc(1024)/kfree -> 119 cycles
  10000 times kmalloc(2048)/kfree -> 119 cycles
  10000 times kmalloc(4096)/kfree -> 119 cycles
  10000 times kmalloc(8192)/kfree -> 119 cycles
  10000 times kmalloc(16384)/kfree -> 119 cycles

[akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Arnd Bergmann 877417e6ff Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
CC_OPTIMIZE_FOR_SIZE disables the often useful -Wmaybe-unused warning,
because that causes a ridiculous amount of false positives when combined
with -Os.

This means a lot of warnings don't show up in testing by the developers
that should see them with an 'allmodconfig' kernel that has
CC_OPTIMIZE_FOR_SIZE enabled, but only later in randconfig builds
that don't.

This changes the Kconfig logic around CC_OPTIMIZE_FOR_SIZE to make
it a 'choice' statement defaulting to CC_OPTIMIZE_FOR_PERFORMANCE
that gets added for this purpose. The allmodconfig and allyesconfig
kernels now default to -O2 with the maybe-unused warning enabled.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-05-10 17:12:48 +02:00
Andi Kleen f76be61755 Make CONFIG_FHANDLE default y
Newer Fedora and OpenSUSE didn't boot with my standard configuration.
It took me some time to figure out why, in fact I had to write a script
to try different config options systematically.

The problem is that something (systemd) in dracut depends on
CONFIG_FHANDLE, which adds open by file handle syscalls.

While it is set in defconfigs it is very easy to miss when updating
older configs because it is not default y.

Make it default y and also depend on EXPERT, as dracut use is likely
widespread.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Nicolas Pitre dbacb0ef67 kconfig option for TRIM_UNUSED_KSYMS
The config option to enable it all.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
2016-03-29 16:30:57 -04:00
Linus Torvalds 6b5f04b6cf Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "cgroup changes for v4.6-rc1.  No userland visible behavior changes in
  this pull request.  I'll send out a separate pull request for the
  addition of cgroup namespace support.

   - The biggest change is the revamping of cgroup core task migration
     and controller handling logic.  There are quite a few places where
     controllers and tasks are manipulated.  Previously, many of those
     places implemented custom operations for each specific use case
     assuming specific starting conditions.  While this worked, it makes
     the code fragile and difficult to follow.

     The bulk of this pull request restructures these operations so that
     most related operations are performed through common helpers which
     implement recursive (subtrees are always processed consistently)
     and idempotent (they make cgroup hierarchy converge to the target
     state rather than performing operations assuming specific starting
     conditions).  This makes the code a lot easier to understand,
     verify and extend.

   - Implicit controller support is added.  This is primarily for using
     perf_event on the v2 hierarchy so that perf can match cgroup v2
     path without requiring the user to do anything special.  The kernel
     portion of perf_event changes is acked but userland changes are
     still pending review.

   - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
     certain environments.

   - There is a regression introduced during v4.4 devel cycle where
     attempts to migrate zombie tasks can mess up internal object
     management.  This was fixed earlier this week and included in this
     pull request w/ stable cc'd.

   - Misc non-critical fixes and improvements"

* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
  cgroup: avoid false positive gcc-6 warning
  cgroup: ignore css_sets associated with dead cgroups during migration
  Documentation: cgroup v2: Trivial heading correction.
  cgroup: implement cgroup_subsys->implicit_on_dfl
  cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
  cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
  cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
  cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
  cgroup: Trivial correction to reflect controller.
  cgroup: remove stale item in cgroup-v1 document INDEX file.
  cgroup: update css iteration in cgroup_update_dfl_csses()
  cgroup: allocate 2x cgrp_cset_links when setting up a new root
  cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
  cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
  cgroup: use cgroup_apply_enable_control() in cgroup creation path
  cgroup: combine cgroup_mutex locking and offline css draining
  cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
  cgroup: introduce cgroup_{save|propagate|restore}_control()
  cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
  cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
  ...
2016-03-18 20:25:49 -07:00
Linus Torvalds bb7aeae3d6 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
  fixes scattered across the subsystem.

  IMA now requires signed policy, and that policy is also now measured
  and appraised"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
  X.509: Make algo identifiers text instead of enum
  akcipher: Move the RSA DER encoding check to the crypto layer
  crypto: Add hash param to pkcs1pad
  sign-file: fix build with CMS support disabled
  MAINTAINERS: update tpmdd urls
  MODSIGN: linux/string.h should be #included to get memcpy()
  certs: Fix misaligned data in extra certificate list
  X.509: Handle midnight alternative notation in GeneralizedTime
  X.509: Support leap seconds
  Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
  X.509: Fix leap year handling again
  PKCS#7: fix unitialized boolean 'want'
  firmware: change kernel read fail to dev_dbg()
  KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
  KEYS: Reserve an extra certificate symbol for inserting without recompiling
  modsign: hide openssl output in silent builds
  tpm_tis: fix build warning with tpm_tis_resume
  ima: require signed IMA policy
  ima: measure and appraise the IMA policy itself
  ima: load policy using path
  ...
2016-03-17 11:33:45 -07:00
Linus Torvalds 271ecc5253 Merge branch 'akpm' (patches from Andrew)
Merge first patch-bomb from Andrew Morton:

 - some misc things

 - ofs2 updates

 - about half of MM

 - checkpatch updates

 - autofs4 update

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  autofs4: fix string.h include in auto_dev-ioctl.h
  autofs4: use pr_xxx() macros directly for logging
  autofs4: change log print macros to not insert newline
  autofs4: make autofs log prints consistent
  autofs4: fix some white space errors
  autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
  autofs4: fix coding style line length in autofs4_wait()
  autofs4: fix coding style problem in autofs4_get_set_timeout()
  autofs4: coding style fixes
  autofs: show pipe inode in mount options
  kallsyms: add support for relative offsets in kallsyms address table
  kallsyms: don't overload absolute symbol type for percpu symbols
  x86: kallsyms: disable absolute percpu symbols on !SMP
  checkpatch: fix another left brace warning
  checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
  checkpatch: warn on bare unsigned or signed declarations without int
  checkpatch: exclude asm volatile from complex macro check
  mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
  mm: migrate: consolidate mem_cgroup_migrate() calls
  mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
  ...
2016-03-16 11:51:08 -07:00
Ard Biesheuvel 2213e9a66b kallsyms: add support for relative offsets in kallsyms address table
Similar to how relative extables are implemented, it is possible to emit
the kallsyms table in such a way that it contains offsets relative to
some anchor point in the kernel image rather than absolute addresses.

On 64-bit architectures, it cuts the size of the kallsyms address table
in half, since offsets between kernel symbols can typically be expressed
in 32 bits.  This saves several hundreds of kilobytes of permanent
.rodata on average.  In addition, the kallsyms address table is no
longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
effect, so the relocation work done after decompression now doesn't have
to do relocation updates for all these values.  This saves up to 24
bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
which easily adds up to a couple of megabytes of uncompressed __init
data on ppc64 or arm64.  Even if these relocation entries typically
compress well, the combined size reduction of 2.8 MB uncompressed for a
ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
KB space saving in the compressed image.

Since it is useful for some architectures (like x86) to retain the
ability to emit absolute values as well, this patch also adds support
for capturing both absolute and relative values when
KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
addresses as positive 32-bit values, and addresses relative to the
lowest encountered relative symbol as negative values, which are
subtracted from the runtime address of this base symbol to produce the
actual address.

Support for the above is enabled by default for all architectures except
IA-64 and Tile-GX, whose symbols are too far apart to capture in this
manner.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Ard Biesheuvel 4d5d5664c9 x86: kallsyms: disable absolute percpu symbols on !SMP
scripts/kallsyms.c has a special --absolute-percpu command line option
which deals with the zero based per cpu offsets that are used when
building for SMP on x86_64.  This means that the option should only be
passed in that case, so add a Kconfig symbol with the correct predicate,
and use that instead.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Geliang Tang e6fd1fb3b5 init/main.c: use list_for_each_entry()
Use list_for_each_entry() instead of list_for_each() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Linus Torvalds 710d60cbf1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug updates from Thomas Gleixner:
 "This is the first part of the ongoing cpu hotplug rework:

   - Initial implementation of the state machine

   - Runs all online and prepare down callbacks on the plugged cpu and
     not on some random processor

   - Replaces busy loop waiting with completions

   - Adds tracepoints so the states can be followed"

More detailed commentary on this work from an earlier email:
 "What's wrong with the current cpu hotplug infrastructure?

   - Asymmetry

     The hotplug notifier mechanism is asymmetric versus the bringup and
     teardown.  This is mostly caused by the notifier mechanism.

   - Largely undocumented dependencies

     While some notifiers use explicitely defined notifier priorities,
     we have quite some notifiers which use numerical priorities to
     express dependencies without any documentation why.

   - Control processor driven

     Most of the bringup/teardown of a cpu is driven by a control
     processor.  While it is understandable, that preperatory steps,
     like idle thread creation, memory allocation for and initialization
     of essential facilities needs to be done before a cpu can boot,
     there is no reason why everything else must run on a control
     processor.  Before this patch series, bringup looks like this:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu

       bring the rest up

   - All or nothing approach

     There is no way to do partial bringups.  That's something which is
     really desired because we waste e.g.  at boot substantial amount of
     time just busy waiting that the cpu comes to life.  That's stupid
     as we could very well do preparatory steps and the initial IPI for
     other cpus and then go back and do the necessary low level
     synchronization with the freshly booted cpu.

   - Minimal debuggability

     Due to the notifier based design, it's impossible to switch between
     two stages of the bringup/teardown back and forth in order to test
     the correctness.  So in many hotplug notifiers the cancel
     mechanisms are either not existant or completely untested.

   - Notifier [un]registering is tedious

     To [un]register notifiers we need to protect against hotplug at
     every callsite.  There is no mechanism that bringup/teardown
     callbacks are issued on the online cpus, so every caller needs to
     do it itself.  That also includes error rollback.

  What's the new design?

     The base of the new design is a symmetric state machine, where both
     the control processor and the booting/dying cpu execute a well
     defined set of states.  Each state is symmetric in the end, except
     for some well defined exceptions, and the bringup/teardown can be
     stopped and reversed at almost all states.

     So the bringup of a cpu will look like this in the future:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu

                                       bring itself up

     The synchronization step does not require the control cpu to wait.
     That mechanism can be done asynchronously via a worker or some
     other mechanism.

     The teardown can be made very similar, so that the dying cpu cleans
     up and brings itself down.  Cleanups which need to be done after
     the cpu is gone, can be scheduled asynchronously as well.

  There is a long way to this, as we need to refactor the notion when a
  cpu is available.  Today we set the cpu online right after it comes
  out of the low level bringup, which is not really correct.

  The proper mechanism is to set it to available, i.e. cpu local
  threads, like softirqd, hotplug thread etc. can be scheduled on that
  cpu, and once it finished all booting steps, it's set to online, so
  general workloads can be scheduled on it.  The reverse happens on
  teardown.  First thing to do is to forbid scheduling of general
  workloads, then teardown all the per cpu resources and finally shut it
  off completely.

  This patch series implements the basic infrastructure for this at the
  core level.  This includes the following:

   - Basic state machine implementation with well defined states, so
     ordering and prioritization can be expressed.

   - Interfaces to [un]register state callbacks

     This invokes the bringup/teardown callback on all online cpus with
     the proper protection in place and [un]installs the callbacks in
     the state machine array.

     For callbacks which have no particular ordering requirement we have
     a dynamic state space, so that drivers don't have to register an
     explicit hotplug state.

     If a callback fails, the code automatically does a rollback to the
     previous state.

   - Sysfs interface to drive the state machine to a particular step.

     This is only partially functional today.  Full functionality and
     therefor testability will be achieved once we converted all
     existing hotplug notifiers over to the new scheme.

   - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
     processor:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu
       wait for boot
                                       bring itself up

                                       Signal completion to control cpu

     In a previous step of this work we've done a full tree mechanical
     conversion of all hotplug notifiers to the new scheme.  The balance
     is a net removal of about 4000 lines of code.

     This is not included in this series, as we decided to take a
     different approach.  Instead of mechanically converting everything
     over, we will do a proper overhaul of the usage sites one by one so
     they nicely fit into the symmetric callback scheme.

     I decided to do that after I looked at the ugliness of some of the
     converted sites and figured out that their hotplug mechanism is
     completely buggered anyway.  So there is no point to do a
     mechanical conversion first as we need to go through the usage
     sites one by one again in order to achieve a full symmetric and
     testable behaviour"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  cpu/hotplug: Document states better
  cpu/hotplug: Fix smpboot thread ordering
  cpu/hotplug: Remove redundant state check
  cpu/hotplug: Plug death reporting race
  rcu: Make CPU_DYING_IDLE an explicit call
  cpu/hotplug: Make wait for dead cpu completion based
  cpu/hotplug: Let upcoming cpu bring itself fully up
  arch/hotplug: Call into idle with a proper state
  cpu/hotplug: Move online calls to hotplugged cpu
  cpu/hotplug: Create hotplug threads
  cpu/hotplug: Split out the state walk into functions
  cpu/hotplug: Unpark smpboot threads from the state machine
  cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
  cpu/hotplug: Implement setup/removal interface
  cpu/hotplug: Make target state writeable
  cpu/hotplug: Add sysfs state interface
  cpu/hotplug: Hand in target state to _cpu_up/down
  cpu/hotplug: Convert the hotplugged cpu work to a state machine
  cpu/hotplug: Convert to a state machine for the control processor
  cpu/hotplug: Add tracepoints
  ...
2016-03-15 13:50:29 -07:00
Linus Torvalds d09e356ad0 Merge branch 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull read-only kernel memory updates from Ingo Molnar:
 "This tree adds two (security related) enhancements to the kernel's
  handling of read-only kernel memory:

   - extend read-only kernel memory to a new class of formerly writable
     kernel data: 'post-init read-only memory' via the __ro_after_init
     attribute, and mark the ARM and x86 vDSO as such read-only memory.

     This kind of attribute can be used for data that requires a once
     per bootup initialization sequence, but is otherwise never modified
     after that point.

     This feature was based on the work by PaX Team and Brad Spengler.

     (by Kees Cook, the ARM vDSO bits by David Brown.)

   - make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
     Kconfig option.  This simplifies the kernel and also signals that
     read-only memory is the default model and a first-class citizen.
     (Kees Cook)"

* 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ARM/vdso: Mark the vDSO code read-only after init
  x86/vdso: Mark the vDSO code read-only after init
  lkdtm: Verify that '__ro_after_init' works correctly
  arch: Introduce post-init read-only memory
  x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
  mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
  asm-generic: Consolidate mark_rodata_ro()
2016-03-14 16:58:50 -07:00
Parav Pandit 6cc578df40 cgroup: Trivial correction to reflect controller.
Trivial correction in menuconfig help to reflect PIDs as
controller instead of subsystem to align to rest of the text
and documentation.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-05 07:48:01 -05:00
David Howells d43de6c780 akcipher: Move the RSA DER encoding check to the crypto layer
Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key
subtype to the rsa crypto module's pkcs1pad template.  This means that the
public_key subtype no longer has any dependencies on public key type.

To make this work, the following changes have been made:

 (1) The rsa pkcs1pad template is now used for RSA keys.  This strips off the
     padding and returns just the message hash.

 (2) In a previous patch, the pkcs1pad template gained an optional second
     parameter that, if given, specifies the hash used.  We now give this,
     and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5
     encoding and verifies that the correct digest OID is present.

 (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to
     something that doesn't care about what the encryption actually does
     and and has been merged into public_key.c.

 (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone.  Module signing must set
     CONFIG_CRYPTO_RSA=y instead.

Thoughts:

 (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to
     the padding template?  Should there be multiple padding templates
     registered that share most of the code?

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-03-03 21:49:27 +00:00
Thomas Gleixner 931ef16330 cpu/hotplug: Unpark smpboot threads from the state machine
Handle the smpboot threads in the state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:56 +01:00
Thomas Gleixner cff7d378d3 cpu/hotplug: Convert to a state machine for the control processor
Move the split out steps into a callback array and let the cpu_up/down
code iterate through the array functions. For now most of the
callbacks are asymmetric to resemble the current hotplug maze.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:54 +01:00
Kees Cook d2aa1acad2 mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
It may be useful to debug writes to the readonly sections of memory,
so provide a cmdline "rodata=off" to allow for this. This can be
expanded in the future to support "log" and "write" modes, but that
will need to be architecture-specific.

This also makes KDB software breakpoints more usable, as read-only
mappings can now be disabled on any kernel.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-22 08:51:37 +01:00
Andrey Ryabinin 06bea3dbfe locking/lockdep: Eliminate lockdep_init()
Lockdep is initialized at compile time now.  Get rid of lockdep_init().

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Krinkin <krinkin.m.u@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: mm-commits@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-09 12:03:25 +01:00
Johannes Weiner d886f4e483 mm: memcontrol: rein in the CONFIG space madness
What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.

[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Johannes Weiner 489c2a20a4 mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.

[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Yaowei Bai f057f3b226 init/do_mounts: initrd_load() can be boolean
Make initrd_load() return bool due to this particular function only using
either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Yaowei Bai 31c025b5fe init/main.c: obsolete_checksetup can be boolean
Make obsolete_checksetup() return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Linus Torvalds 2d663b5581 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Seven audit patches for 4.5, all very minor despite the diffstat.

  The diffstat churn for linux/audit.h can be attributed to needing to
  reshuffle the linux/audit.h header to fix the seccomp auditing issue
  (see the commit description for details).

  Besides the seccomp/audit fix, most of the fixes are around trying to
  improve the connection with the audit daemon and a Kconfig
  simplification.  Nothing crazy, and everything passes our little
  audit-testsuite"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: always enable syscall auditing when supported and audit is enabled
  audit: force seccomp event logging to honor the audit_enabled flag
  audit: Delete unnecessary checks before two function calls
  audit: wake up threads if queue switched from limited to unlimited
  audit: include auditd's threads in audit_log_start() wait exception
  audit: remove audit_backlog_wait_overflow
  audit: don't needlessly reset valid wait time
2016-01-17 18:48:49 -08:00
Linus Torvalds 0cbeafb245 Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync).  This isn't quite complete but DAX
      is new and it's good enough and the guys have a handle on what
      needs to be done - I expect this to be wrapped in the next week or
      two.

  - some vsprintf maintenance work

  - various other misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (145 commits)
  printk: change recursion_bug type to bool
  lib/vsprintf: factor out %pN[F] handler as netdev_bits()
  lib/vsprintf: refactor duplicate code to special_hex_number()
  printk-formats.txt: remove unimplemented %pT
  printk: help pr_debug and pr_devel to optimize out arguments
  lib/test_printf.c: test dentry printing
  lib/test_printf.c: add test for large bitmaps
  lib/test_printf.c: account for kvasprintf tests
  lib/test_printf.c: add a few number() tests
  lib/test_printf.c: test precision quirks
  lib/test_printf.c: check for out-of-bound writes
  lib/test_printf.c: don't BUG
  lib/kasprintf.c: add sanity check to kvasprintf
  lib/vsprintf.c: warn about too large precisions and field widths
  lib/vsprintf.c: help gcc make number() smaller
  lib/vsprintf.c: expand field_width to 24 bits
  lib/vsprintf.c: eliminate potential race in string()
  lib/vsprintf.c: move string() below widen_string()
  lib/vsprintf.c: pull out padding code from dentry_name()
  printk: do cond_resched() between lines while outputting to consoles
  ...
2016-01-17 12:58:52 -08:00
Linus Torvalds e535d74bc5 A relatively boring cycle in the docs tree. There's a few kernel-doc
fixes and various document tweaks.
 
 One patch reaches out of the documentation subtree to fix a comment in
 init/do_mounts_rd.c.  There didn't seem to be anybody more appropriate to
 take that one, so I accepted it.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmmJ8AAoJEI3ONVYwIuV6uqwP/0mnqdxVWo47ohaYJP7q0Soh
 ovJAbfttxKnkmOdGbWcNIJtTiw+MpdF805CYR+2treE0zvEEDodg7BhkDnmKZJ9n
 F1r53JrIj769E1c5ETmWTHcBt3jjtKyQIbBmDr4YTgX91dlKF28o1bMmyDECWIcT
 PktTlPUidDtffKMn3klh6baPCMrTpLJ8aLshBzUrQhrQY8lxcZKAU+98vtFzYofG
 LXCSulMYXumb7XBxErTLQZhmJslD4gaDMh2xkov6ALS8XNHnfoUIFRbArAllNfTf
 LQGJ6Q5qnn58UWi9F/vgDqx7+d1KIPUjBxJR9wfa0w9ggQhA9ly2BSN/fllbiSbp
 yIi1JS4hwBe8H/h577BNC3xjmgVN7mazZsXlS+fg3G16gpv4JdWeRY4efjosFIzQ
 EIJxB8qAovUNqw4s1mzRIJ5B9L7PEK27O6z8N27Fiw4EigtMTFAOC2/GD3ELx4iJ
 p1doiSr+wjfDcFd8kdIUiDKGrTSTXwNy3hUfrhzQyaEjDTJnx3+1+ono1orSazPO
 Fr2RSsC5VzX4IYSuxTMvFSKjN1Iiu8xqwq3IdclHXrBhRvwOF2wpjjQ5Guf0lHBJ
 FLBahSjZqt01kmwFykxoHps+VeSwpoEen6rClBQolfmtYVDTvgRNN46AxK9jZ8T4
 jZmCNNs/mYzrqo/RTnmw
 =u38W
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.5' of git://git.lwn.net/linux

Pull documentation updates from Jon Corbet:
 "A relatively boring cycle in the docs tree.  There's a few kernel-doc
  fixes and various document tweaks.

  One patch reaches out of the documentation subtree to fix a comment in
  init/do_mounts_rd.c.  There didn't seem to be anybody more appropriate
  to take that one, so I accepted it"

* tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits)
  thermal: add description for integral_cutoff unit
  Documentation: update libhugetlbfs site url
  Documentation: Explain pci=conf1,conf2 more verbosely
  DMA-API: fix confusing sentence in Documentation/DMA-API.txt
  Documentation: translations: update linux cross reference link
  Documentation: fix typo in CodingStyle
  init, Documentation: Remove ramdisk_blocksize mentions
  Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main()
  Documentation: HOWTO: update versions from 3.x to 4.x
  Documentation: remove outdated references from translations
  Doc: treewide: Fix grammar "a" to "an"
  Documentation: cpu-hotplug: Fix sysfs mount instructions
  can-doc: Add hint about getting timestamps
  Fix CFQ I/O scheduler parameter name in documentation
  Documentation: arm: remove dead links from Marvell Berlin docs
  Documentation: HOWTO: update code cross reference link
  Doc: Docbook/iio: Fix typo in iio.tmpl
  DocBook: make index.html generation less verbose by default
  DocBook: Cleanup: remove an unused $(call) line
  DocBook: Add a help message for DOCBOOKS env var
  ...
2016-01-17 11:55:07 -08:00
Riku Voipio b2113a417c uselib: default depending if libc5 was used
uselib hasn't been used since libc5; glibc does not use it.  Deprecate
uselib a bit more, by making the default y only if libc5 was widely used
on the plaform.

This makes arm64 kernel built with defconfig slightly smaller

bloat-o-meter:
  add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
  function                                     old     new   delta
  kernel_config_data                         18164   18162      -2
  uselib_flags                                  20       -     -20
  padzero                                      216     192     -24
  sys_uselib                                   380       -    -380
  load_elf_library                             964       -    -964

Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:24 -08:00
Paul Moore cb74ed278f audit: always enable syscall auditing when supported and audit is enabled
To the best of our knowledge, everyone who enables audit at compile
time also enables syscall auditing; this patch simplifies the Kconfig
menus by removing the option to disable syscall auditing when audit
is selected and the target arch supports it.

Signed-off-by: Paul Moore <pmoore@redhat.com>
2016-01-13 09:18:55 -05:00
Linus Torvalds 34a9304a96 Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cgroup v2 interface is now official.  It's no longer hidden behind a
   devel flag and can be mounted using the new cgroup2 fs type.

   Unfortunately, cpu v2 interface hasn't made it yet due to the
   discussion around in-process hierarchical resource distribution and
   only memory and io controllers can be used on the v2 interface at the
   moment.

 - The existing documentation which has always been a bit of mess is
   relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
   is added as the authoritative documentation for the v2 interface.

 - Some features are added through for-4.5-ancestor-test branch to
   enable netfilter xt_cgroup match to use cgroup v2 paths.  The actual
   netfilter changes will be merged through the net tree which pulled in
   the said branch.

 - Various cleanups

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rename cgroup documentations
  cgroup: fix a typo.
  cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
  cgroup: demote subsystem init messages to KERN_DEBUG
  cgroup: Fix uninitialized variable warning
  cgroup: put controller Kconfig options in meaningful order
  cgroup: clean up the kernel configuration menu nomenclature
  cgroup_pids: fix a typo.
  Subject: cgroup: Fix incomplete dd command in blkio documentation
  cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
  cpuset: Replace all instances of time_t with time64_t
  cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
  cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
  cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
2016-01-12 19:20:32 -08:00
Ingo Molnar 3104fb3dd4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Adding transitivity uniformly to rcu_node structure ->lock
   acquisitions.  (This is implemented by the first two commits
   on top of v4.4-rc2 due to the pervasive nature of this change.)

 - Documentation updates, including RCU requirements.

 - Expedited grace-period changes.

 - Miscellaneous fixes.

 - Linked-list fixes, courtesy of KTSAN.

 - Torture-test updates.

 - Late-breaking fix to sysrq-generated crash.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:41:48 +01:00
Robert Elliott b03539665d init, Documentation: Remove ramdisk_blocksize mentions
The brd driver has never supported the ramdisk_blocksize kernel
parameter that was in the rd driver it replaced, so remove
mention of this parameter from comments and Documentation.

Commit 9db5579be4 ("rewrite rd") replaced rd with brd, keeping
a brd_blocksize variable in struct brd_device but never using it.

Commit a2cba2913c ("brd: get rid of unused members from struct
brd_device") removed the unused variable.

Commit f5abc8e758 ("Documentation/blockdev/ramdisk.txt: updates")
removed mentions of ramdisk_blocksize from that file.

Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-12-26 05:22:00 -07:00
Johannes Weiner 6bf024e693 cgroup: put controller Kconfig options in meaningful order
To make it easier to quickly find what's needed list the basic
resource controllers of cgroup2 first - io, memory, cpu - while
pushing the more exotic and/or legacy controllers to the bottom.

tj: Removed spurious "&& CGROUPS" from CGROUP_PERF as suggested by Li.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-18 12:43:15 -05:00
Johannes Weiner a0166ec4b0 cgroup: clean up the kernel configuration menu nomenclature
The config options for the different cgroup controllers use various
terms: resource controller, cgroup subsystem, etc. Simplify this to
"controller", which is clear enough in the cgroup context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-18 12:43:15 -05:00
Chris Wilson 86fffe4a61 kernel: remove stop_machine() Kconfig dependency
Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable.  This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.

For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.

This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.

Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
Paul E. McKenney 967dcb8fe6 rcu: Wire up rcu_end_inkernel_boot()
This commit adds the invocation of rcu_end_inkernel_boot() just before
init is invoked.  This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig
option to do something useful and prepares for the upcoming
rcupdate.rcu_normal_after_boot kernel parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:54 -08:00
Mathieu Desnoyers 5b25b13ab0 sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system.  It is
implemented by calling synchronize_sched().  It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier.  For synchronization primitives
that distinguish between read-side and write-side (e.g.  userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
  - DNS server (Knot DNS) https://www.knot-dns.cz/
  - Network sniffer (http://netsniff-ng.org/)
  - Distributed object storage (https://sheepdog.github.io/sheepdog/)
  - User-space tracing (http://lttng.org)
  - Network storage system (https://www.gluster.org/)
  - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
  - Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking.  Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
  - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux.  They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 2202847 writes
signal-based scheme:          9830061167 reads,    6700 writes
sys_membarrier:               9952759104 reads,     425 writes
sys_membarrier (dyn. check):  7970328887 reads,     425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)

NAME
       membarrier - issue memory barriers on a set of threads

SYNOPSIS
       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);

DESCRIPTION
       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query  the  set  of  supported commands. It returns a bitmask of
              supported commands.

       MEMBARRIER_CMD_SHARED
              Execute a memory barrier on all threads running on  the  system.
              Upon  return from system call, the caller thread is ensured that
              all running threads have passed through a state where all memory
              accesses  to  user-space  addresses  match program order between
              entry to and return from the system  call  (non-running  threads
              are de facto in such a state). This covers threads from all pro=E2=80=90
              cesses running on the system.  This command returns 0.

       The flags argument needs to be 0. For future extensions.

       All memory accesses performed  in  program  order  from  each  targeted
       thread is guaranteed to be ordered with respect to sys_membarrier(). If
       we use the semantic "barrier()" to represent a compiler barrier forcing
       memory  accesses  to  be performed in program order across the barrier,
       and smp_mb() to represent explicit memory barriers forcing full  memory
       ordering  across  the barrier, we have the following ordering table for
       each pair of barrier(), sys_membarrier() and smp_mb():

       The pair ordering is detailed as (O: ordered, X: not ordered):

                              barrier()   smp_mb() sys_membarrier()
              barrier()          X           X            O
              smp_mb()           X           O            O
              sys_membarrier()   O           O            O

RETURN VALUE
       On success, these system calls return zero.  On error, -1 is  returned,
       and errno is set appropriately. For a given command, with flags
       argument set to 0, this system call is guaranteed to always return the
       same value until reboot.

ERRORS
       ENOSYS System call is not implemented.

       EINVAL Invalid arguments.

Linux                             2015-04-15                     MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 15:21:34 -07:00
Dave Young 2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker 90f023030e kmod: use system_unbound_wq instead of khelper
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper.  This workqueue has
unbound properties and thus a wide affinity inherited by all its children.

Now khelper also has special properties that we aren't much interested in:
ordered and singlethread.  There is really no need about ordering as all
we do is creating kernel threads.  This can be done concurrently.  And
singlethread is a useless limitation as well.

The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.

The only worrysome specific is their affinity to the node of the current
CPU.  It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.

This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Linus Torvalds b793c005ce Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

   - PKCS#7 support added to support signed kexec, also utilized for
     module signing.  See comments in 3f1e1bea.

     ** NOTE: this requires linking against the OpenSSL library, which
        must be installed, e.g.  the openssl-devel on Fedora **

   - Smack
      - add IPv6 host labeling; ignore labels on kernel threads
      - support smack labeling mounts which use binary mount data

   - SELinux:
      - add ioctl whitelisting (see
        http://kernsec.org/files/lss2015/vanderstoep.pdf)
      - fix mprotect PROT_EXEC regression caused by mm change

   - Seccomp:
      - add ptrace options for suspend/resume"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
  PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
  Documentation/Changes: Now need OpenSSL devel packages for module signing
  scripts: add extract-cert and sign-file to .gitignore
  modsign: Handle signing key in source tree
  modsign: Use if_changed rule for extracting cert from module signing key
  Move certificate handling to its own directory
  sign-file: Fix warning about BIO_reset() return value
  PKCS#7: Add MODULE_LICENSE() to test module
  Smack - Fix build error with bringup unconfigured
  sign-file: Document dependency on OpenSSL devel libraries
  PKCS#7: Appropriately restrict authenticated attributes and content type
  KEYS: Add a name for PKEY_ID_PKCS7
  PKCS#7: Improve and export the X.509 ASN.1 time object decoder
  modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
  extract-cert: Cope with multiple X.509 certificates in a single file
  sign-file: Generate CMS message as signature instead of PKCS#7
  PKCS#7: Support CMS messages also [RFC5652]
  X.509: Change recorded SKID & AKID to not include Subject or Issuer
  PKCS#7: Check content type and versions
  MAINTAINERS: The keyrings mailing list has moved
  ...
2015-09-08 12:41:25 -07:00
Linus Torvalds 7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Linus Torvalds 6c0f568e84 Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - a few misc things

 - Andy's "ambient capabilities"

 - fs/nofity updates

 - the ocfs2 queue

 - kernel/watchdog.c updates and feature work.

 - some of MM.  Includes Andrea's userfaultfd feature.

[ Hadn't noticed that userfaultfd was 'default y' when applying the
  patches, so that got fixed in this merge instead.  We do _not_ mark
  new features that nobody uses yet 'default y'   - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  mm/hugetlb.c: make vma_has_reserves() return bool
  mm/madvise.c: make madvise_behaviour_valid() return bool
  mm/memory.c: make tlb_next_batch() return bool
  mm/dmapool.c: change is_page_busy() return from int to bool
  mm: remove struct node_active_region
  mremap: simplify the "overlap" check in mremap_to()
  mremap: don't do uneccesary checks if new_len == old_len
  mremap: don't do mm_populate(new_addr) on failure
  mm: move ->mremap() from file_operations to vm_operations_struct
  mremap: don't leak new_vma if f_op->mremap() fails
  mm/hugetlb.c: make vma_shareable() return bool
  mm: make GUP handle pfn mapping unless FOLL_GET is requested
  mm: fix status code which move_pages() returns for zero page
  mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
  genalloc: add support of multiple gen_pools per device
  genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
  mm/memblock: WARN_ON when nid differs from overlap region
  Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
  mm: defer flush of writable TLB entries
  mm: send one IPI per CPU to TLB flush all entries after unmapping pages
  ...
2015-09-05 14:27:38 -07:00
Mel Gorman 72b252aed5 mm: send one IPI per CPU to TLB flush all entries after unmapping pages
An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs.  There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate
CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high.  This patch uses a simple structure that tracks CPUs that
potentially have TLB entries for pages being unmapped.  When the unmapping
is complete, the full TLB is flushed on the assumption that a refill cost
is lower than flushing individual entries.

Architectures wishing to do this must give the following guarantee.

        If a clean page is unmapped and not immediately flushed, the
        architecture must guarantee that a write to that linear address
        from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.  The
architecture should consider whether the cost of the full TLB flush is
higher than sending an IPI to flush each individual entry.  An additional
architecture helper called flush_tlb_local is required.  It's a trivial
wrapper with some accounting in the x86 case.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure.  The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite.  The test case uses NR_CPU
readers of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

                                           4.2.0-rc1          4.2.0-rc1
                                             vanilla       flushfull-v7
Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)

           4.2.0-rc1    4.2.0-rc1
             vanilla flushfull-v7
User          581.00       611.43
System       5804.93      4111.76
Elapsed       161.03       122.12

This is showing that the readers completed 24.40% faster with 29% less
system CPU time.  From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

                                           4.2.0-rc1          4.2.0-rc1
                                             vanilla       flushfull-v7
Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)

           4.2.0-rc1    4.2.0-rc1
             vanilla flushfull-v7
User           58.09        57.64
System        111.82        76.56
Elapsed        27.29        22.55

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or have
relatively few mapped pages.  It will have an unpredictable impact on the
workload running on the CPU being flushed as it'll depend on how many TLB
entries need to be refilled and how long that takes.  Worst case, the TLB
will be completely cleared of active entries when the target PFNs were not
resident at all.

[sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Andrea Arcangeli a14c151e56 userfaultfd: buildsystem activation
This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds 8bdc69b764 Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - a new PIDs controller is added.  It turns out that PIDs are actually
   an independent resource from kmem due to the limited PID space.

 - more core preparations for the v2 interface.  Once cpu side interface
   is settled, it should be ready for lifting the devel mask.
   for-4.3-unified-base was temporarily branched so that other trees
   (block) can pull cgroup core changes that blkcg changes depend on.

 - a non-critical idr_preload usage bug fix.

* 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: pids: fix invalid get/put usage
  cgroup: introduce cgroup_subsys->legacy_name
  cgroup: don't print subsystems for the default hierarchy
  cgroup: make cftype->private a unsigned long
  cgroup: export cgrp_dfl_root
  cgroup: define controller file conventions
  cgroup: fix idr_preload usage
  cgroup: add documentation for the PIDs controller
  cgroup: implement the PIDs subsystem
  cgroup: allow a cgroup subsystem to reject a fork
2015-09-02 08:04:23 -07:00
Oleg Nesterov bf3eac84c4 percpu-rwsem: kill CONFIG_PERCPU_RWSEM
Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-15 13:52:11 +02:00
David Howells cfc411e7ff Move certificate handling to its own directory
Move certificate handling out of the kernel/ directory and into a certs/
directory to get all the weird stuff in one place and move the generated
signing keys into this directory.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-14 16:06:13 +01:00
David Howells 228c37ff98 sign-file: Document dependency on OpenSSL devel libraries
The revised sign-file program is no longer a script that wraps the openssl
program, but now rather a program that makes use of OpenSSL's crypto
library.  This means that to build the sign-file program, the kernel build
process now has a dependency on the OpenSSL development packages in
addition to OpenSSL itself.

Document this in Kconfig and in module-signing.txt.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-12 17:01:01 +01:00
Ingo Molnar 9b9412dc70 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

  - The combination of tree geometry-initialization simplifications
    and OS-jitter-reduction changes to expedited grace periods.
    These two are stacked due to the large number of conflicts
    that would otherwise result.

    [ With one addition, a temporary commit to silence a lockdep false
      positive. Additional changes to the expedited grace-period
      primitives (queued for 4.4) remove the cause of this false
      positive, and therefore include a revert of this temporary commit. ]

  - Documentation updates.

  - Torture-test updates.

  - Miscellaneous fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:12:12 +02:00
David Woodhouse 99d27b1b52 modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS option
Let the user explicitly provide a file containing trusted keys, instead of
just automatically finding files matching *.x509 in the build tree and
trusting whatever we find. This really ought to be an *explicit*
configuration, and the build rules for dealing with the files were
fairly painful too.

Fix applied from James Morris that removes an '=' from a macro definition
in kernel/Makefile as this is a feature that only exists from GNU make 3.82
onwards.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse fb11794991 modsign: Use single PEM file for autogenerated key
The current rule for generating signing_key.priv and signing_key.x509 is
a classic example of a bad rule which has a tendency to break parallel
make. When invoked to create *either* target, it generates the other
target as a side-effect that make didn't predict.

So let's switch to using a single file signing_key.pem which contains
both key and certificate. That matches what we do in the case of an
external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
slightly cleaner.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse 1329e8cc69 modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if needed
Where an external PEM file or PKCS#11 URI is given, we can get the cert
from it for ourselves instead of making the user drop signing_key.x509
in place for us.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse 19e91b69d7 modsign: Allow external signing key to be specified
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Howells 091f6e26eb MODSIGN: Extract the blob PKCS#7 signature verifier from module signing
Extract the function that drives the PKCS#7 signature verification given a
data blob and a PKCS#7 blob out from the module signing code and lump it with
the system keyring code as it's generic.  This makes it independent of module
config options and opens it to use by the firmware loader.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Kyle McMartin <kyle@kernel.org>
2015-08-07 16:26:13 +01:00
David Howells 3f1e1bea34 MODSIGN: Use PKCS#7 messages as module signatures
Move to using PKCS#7 messages as module signatures because:

 (1) We have to be able to support the use of X.509 certificates that don't
     have a subjKeyId set.  We're currently relying on this to look up the
     X.509 certificate in the trusted keyring list.

 (2) PKCS#7 message signed information blocks have a field that supplies the
     data required to match with the X.509 certificate that signed it.

 (3) The PKCS#7 certificate carries fields that specify the digest algorithm
     used to generate the signature in a standardised way and the X.509
     certificates specify the public key algorithm in a standardised way - so
     we don't need our own methods of specifying these.

 (4) We now have PKCS#7 message support in the kernel for signed kexec purposes
     and we can make use of this.

To make this work, the old sign-file script has been replaced with a program
that needs compiling in a previous patch.  The rules to build it are added
here.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
2015-08-07 16:26:13 +01:00
Mel Gorman 4248b0da46 fs, file table: reinit files_stat.max_files after deferred memory initialisation
Dave Hansen reported the following;

	My laptop has been behaving strangely with 4.2-rc2.  Once I log
	in to my X session, I start getting all kinds of strange errors
	from applications and see this in my dmesg:

        	VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using.  This
patch recalculates file-max after deferred memory initialisation.  Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1:             files_stat.max_files = 6582781
4.2-rc2:         files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Paul E. McKenney be55fa2ad2 rcu: Hide RCU_NOCB_CPU behind RCU_EXPERT
This commit prevents Kconfig from asking the user about RCU_NOCB_CPU
unless the user really wants to be asked.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
2015-07-22 15:27:25 -07:00
Aleksa Sarai 49b786ea14 cgroup: implement the PIDs subsystem
Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup -- even if migrating
mid-fork it must be able to fork in the parent first).

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-07-14 17:29:23 -04:00
Paul E. McKenney d1ec4c34c7 rcu: Drop RCU_USER_QS in favor of NO_HZ_FULL
The RCU_USER_QS Kconfig parameter is now just a synonym for NO_HZ_FULL,
so this commit eliminates RCU_USER_QS, replacing all uses with NO_HZ_FULL.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
2015-07-06 13:52:18 -07:00
Linus Torvalds 22a093b2fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix numa balancing stats in /proc/pid/sched
  sched/numa: Show numa_group ID in /proc/sched_debug task listings
  sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
  sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
  sched/stat: Simplify the sched_info accounting dependency
2015-07-04 08:56:53 -07:00
Linus Torvalds 91cca0f0ff Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull max log buf size increase from Ingo Molnar:
 "Ran into this limit recently, so increase it by an order of magnitude"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25
2015-07-04 08:16:41 -07:00
Naveen N. Rao f6db834799 sched/stat: Simplify the sched_info accounting dependency
Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
sched_info, which results in ugly #if clauses.

Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
switch, selected by both.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: a.p.zijlstra@chello.nl
Cc: ricklind@us.ibm.com
Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:30 +02:00
Linus Torvalds 2d01eedf1d Merge branch 'akpm' (patches from Andrew)
Merge third patchbomb from Andrew Morton:

 - the rest of MM

 - scripts/gdb updates

 - ipc/ updates

 - lib/ updates

 - MAINTAINERS updates

 - various other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits)
  genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
  genalloc: rename dev_get_gen_pool() to gen_pool_get()
  x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  MAINTAINERS: add zpool
  MAINTAINERS: BCACHE: Kent Overstreet has changed email address
  MAINTAINERS: move Jens Osterkamp to CREDITS
  MAINTAINERS: remove unused nbd.h pattern
  MAINTAINERS: update brcm gpio filename pattern
  MAINTAINERS: update brcm dts pattern
  MAINTAINERS: update sound soc intel patterns
  MAINTAINERS: remove website for paride
  MAINTAINERS: update Emulex ocrdma email addresses
  bcache: use kvfree() in various places
  libcxgbi: use kvfree() in cxgbi_free_big_mem()
  target: use kvfree() in session alloc and free
  IB/ehca: use kvfree() in ipz_queue_{cd}tor()
  drm/nouveau/gem: use kvfree() in u_free()
  drm: use kvfree() in drm_free_large()
  cxgb4: use kvfree() in t4_free_mem()
  cxgb3: use kvfree() in cxgb_free_mem()
  ...
2015-07-01 17:47:51 -07:00
Linus Torvalds 02201e3f1b Minor merge needed, due to function move.
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
 speed module address lookup.  He found some abusers of the module lock
 doing that too.
 
 A little bit of parameter work here too; including Dan Streetman's breaking
 up the big param mutex so writing a parameter can load another module (yeah,
 really).  Unfortunately that broke the usual suspects, !CONFIG_MODULES and
 !CONFIG_SYSFS, so those fixes were appended too.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
 58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
 b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
 rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
 wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
 GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
 PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
 qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
 HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
 OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
 dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
 tLdh/a9GiCitqS0bT7GE
 =tWPQ
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Main excitement here is Peter Zijlstra's lockless rbtree optimization
  to speed module address lookup.  He found some abusers of the module
  lock doing that too.

  A little bit of parameter work here too; including Dan Streetman's
  breaking up the big param mutex so writing a parameter can load
  another module (yeah, really).  Unfortunately that broke the usual
  suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
  appended too"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
  modules: only use mod->param_lock if CONFIG_MODULES
  param: fix module param locks when !CONFIG_SYSFS.
  rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
  module: add per-module param_lock
  module: make perm const
  params: suppress unused variable error, warn once just in case code changes.
  modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
  kernel/module.c: avoid ifdefs for sig_enforce declaration
  kernel/workqueue.c: remove ifdefs over wq_power_efficient
  kernel/params.c: export param_ops_bool_enable_only
  kernel/params.c: generalize bool_enable_only
  kernel/module.c: use generic module param operaters for sig_enforce
  kernel/params: constify struct kernel_param_ops uses
  sysfs: tightened sysfs permission checks
  module: Rework module_addr_{min,max}
  module: Use __module_address() for module_address_lookup()
  module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  module: Optimize __module_address() using a latched RB-tree
  rbtree: Implement generic latch_tree
  seqlock: Introduce raw_read_seqcount_latch()
  ...
2015-07-01 10:49:25 -07:00
Ingo Molnar fb39f98d14 printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25
So I tried to some kernel debugging that produced a ton of kernel messages
on a big box, and wanted to save them all: but CONFIG_LOG_BUF_SHIFT maxes
out at 21 (2 MB).

Increase it to 25 (32 MB).

This does not affect any existing config or defaults.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-01 10:19:11 +02:00
Mel Gorman 0e1cc95b4c mm: meminit: finish initialisation of struct pages before basic setup
Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred.  One approach is to initialise
memory on demand but it interferes with page allocator paths.  This patch
creates dedicated threads to initialise memory before basic setup.  It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill.  This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.

[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Waiman Long <waiman.long@hp.com
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Scott Norton <scott.norton@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:56 -07:00
Linus Torvalds bbe179f88d Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - threadgroup_lock got reorganized so that its users can pick the
   actual locking mechanism to use.  Its only user - cgroups - is
   updated to use a percpu_rwsem instead of per-process rwsem.

   This makes things a bit lighter on hot paths and allows cgroups to
   perform and fail multi-task (a process) migrations atomically.
   Multi-task migrations are used in several places including the
   unified hierarchy.

 - Delegation rule and documentation added to unified hierarchy.  This
   will likely be the last interface update from the cgroup core side
   for unified hierarchy before lifting the devel mask.

 - Some groundwork for the pids controller which is scheduled to be
   merged in the coming devel cycle.

* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: add delegation section to unified hierarchy documentation
  cgroup: require write perm on common ancestor when moving processes on the default hierarchy
  cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
  kernfs: make kernfs_get_inode() public
  MAINTAINERS: add a cgroup core co-maintainer
  cgroup: fix uninitialised iterator in for_each_subsys_which
  cgroup: replace explicit ss_mask checking with for_each_subsys_which
  cgroup: use bitmask to filter for_each_subsys
  cgroup: add seq_file forward declaration for struct cftype
  cgroup: simplify threadgroup locking
  sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
  sched, cgroup: reorganize threadgroup locking
  cgroup: switch to unsigned long for bitmasks
  cgroup: reorganize include/linux/cgroup.h
  cgroup: separate out include/linux/cgroup-defs.h
  cgroup: fix some comment typos
2015-06-26 19:50:04 -07:00
Linus Torvalds 8d7804a2f0 Driver core patches for 4.2-rc1
Here is the driver core / firmware changes for 4.2-rc1.
 
 A number of small changes all over the place in the driver core, and in
 the firmware subsystem.  Nothing really major, full details in the
 shortlog.  Some of it is a bit of churn, given that the platform driver
 probing changes was found to not work well, so they were reverted.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlWNoCQACgkQMUfUDdst+ym4JACdFrrXoMt2pb8nl5gMidGyM9/D
 jg8AnRgdW8ArDA/xOarULd/X43eA3J3C
 =Al2B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the driver core / firmware changes for 4.2-rc1.

  A number of small changes all over the place in the driver core, and
  in the firmware subsystem.  Nothing really major, full details in the
  shortlog.  Some of it is a bit of churn, given that the platform
  driver probing changes was found to not work well, so they were
  reverted.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
  Revert "base/platform: Only insert MEM and IO resources"
  Revert "base/platform: Continue on insert_resource() error"
  Revert "of/platform: Use platform_device interface"
  Revert "base/platform: Remove code duplication"
  firmware: add missing kfree for work on async call
  fs: sysfs: don't pass count == 0 to bin file readers
  base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
  base/platform: Remove code duplication
  of/platform: Use platform_device interface
  base/platform: Continue on insert_resource() error
  base/platform: Only insert MEM and IO resources
  firmware: use const for remaining firmware names
  firmware: fix possible use after free on name on asynchronous request
  firmware: check for file truncation on direct firmware loading
  firmware: fix __getname() missing failure check
  drivers: of/base: move of_init to driver_init
  drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
  sysfs: disambiguate between "error code" and "failure" in comments
  driver-core: fix build for !CONFIG_MODULES
  driver-core: make __device_attach() static
  ...
2015-06-26 15:07:37 -07:00
Linus Torvalds 47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Vishnu Pratap Singh c69e3c3a0c init/do_mounts.c: add create_dev() failure log
If create_dev() function fails to create the root mount device
(/dev/root), then it goes to panic as root device not found but there is
no printk in this case.  So I have added the log in case it fails to
create the root device.  It will help in debugging.

[akpm@linux-foundation.org: simplify printk(), use pr_emerg(), display errno]
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Ehrenberg <dehrenberg@chromium.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:42 -07:00
Iago López Galeiras 2e13ba54a2 fs, proc: introduce CONFIG_PROC_CHILDREN
Commit 818411616b ("fs, proc: introduce /proc/<pid>/task/<tid>/children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.

This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt.  The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process.  I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.

This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
/proc/<pid>/task/<tid>/children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

Alban tested that /proc/<pid>/task/<tid>/children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE

Signed-off-by: Iago López Galeiras <iago@endocode.com>
Tested-by: Alban Crequy <alban@endocode.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Djalal Harouni <djalal@endocode.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:37 -07:00
Linus Torvalds e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Linus Torvalds 43c9fad942 Power management and ACPI material for v4.2-rc1
- ACPICA update to upstream revision 20150515 including basic
    support for ACPI 6 features: new ACPI tables introduced by
    ACPI 6 (STAO, XENV, WPBT, NFIT, IORT), changes related to the
    other tables (DTRM, FADT, LPIT, MADT), new predefined names
    (_BTH, _CR3, _DSD, _LPI, _MTL, _PRR, _RDI, _RST, _TFP, _TSN),
    fixes and cleanups (Bob Moore, Lv Zheng).
 
  - ACPI device power management core code update to follow ACPI 6
    which reflects the ACPI device power management implementation
    in Windows (Rafael J Wysocki).
 
  - Rework of the backlight interface selection logic to reduce the
    number of kernel command line options and improve the handling
    of DMI quirks that may be involved in that and to make the
    code generally more straightforward (Hans de Goede).
 
  - Fixes for the ACPI Embedded Controller (EC) driver related to
    the handling of EC transactions (Lv Zheng).
 
  - Fix for a regression related to the ACPI resources management
    and resulting from a recent change of ACPI initialization code
    ordering (Rafael J Wysocki).
 
  - Fix for a system initialization regression related to ACPI
    introduced during the 3.14 cycle and caused by running the
    code that switches the platform over to the ACPI mode too
    early in the initialization sequence (Rafael J Wysocki).
 
  - Support for the ACPI _CCA device configuration object related
    to DMA cache coherence (Suravee Suthikulpanit).
 
  - ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).
 
  - ACPI battery driver cleanups (Luis Henriques, Mathias Krause).
 
  - ACPI processor driver cleanups (Hanjun Guo).
 
  - Cleanups and documentation update related to the ACPI device
    properties interface based on _DSD (Rafael J Wysocki).
 
  - ACPI device power management fixes (Rafael J Wysocki).
 
  - Assorted cleanups related to ACPI (Dominik Brodowski. Fabian
    Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).
 
  - Fix for a long-standing issue causing General Protection Faults
    to be generated occasionally on return to user space after resume
    from ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).
 
  - Fix to make the suspend core code return -EBUSY consistently in
    all cases when system suspend is aborted due to wakeup detection
    (Ruchi Kandoi).
 
  - Support for automated device wakeup IRQ handling allowing drivers
    to make their PM support more starightforward (Tony Lindgren).
 
  - New tracepoints for suspend-to-idle tracing and rework of the
    prepare/complete callbacks tracing in the PM core (Todd E Brandt,
    Rafael J Wysocki).
 
  - Wakeup sources framework enhancements (Jin Qian).
 
  - New macro for noirq system PM callbacks (Grygorii Strashko).
 
  - Assorted cleanups related to system suspend (Rafael J Wysocki).
 
  - cpuidle core cleanups to make the code more efficient (Rafael J
    Wysocki).
 
  - powernv/pseries cpuidle driver update (Shilpasri G Bhat).
 
  - cpufreq core fixes related to CPU online/offline that should
    reduce the overhead of these operations quite a bit, unless the
    CPU in question is physically going away (Viresh Kumar, Saravana
    Kannan).
 
  - Serialization of cpufreq governor callbacks to avoid race
    conditions in some cases (Viresh Kumar).
 
  - intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
    Bhargava, Joe Konno).
 
  - cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
    Holla, Felipe Balbi, Tang Yuantian).
 
  - Assorted cleanups in cpufreq drivers and core (Shailendra Verma,
    Fabian Frederick, Wang Long).
 
  - New Device Tree bindings for representing Operating Performance
    Points (Viresh Kumar).
 
  - Updates for the common clock operations support code in the PM
    core (Rajendra Nayak, Geert Uytterhoeven).
 
  - PM domains core code update (Geert Uytterhoeven).
 
  - Intel Knights Landing support for the RAPL (Running Average Power
    Limit) power capping driver (Dasaratharaman Chandramouli).
 
  - Fixes related to the floor frequency setting on Atom SoCs in the
    RAPL power capping driver (Ajay Thomas).
 
  - Runtime PM framework documentation update (Ben Dooks).
 
  - cpupower tool fix (Herton R Krzesinski).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJViJdWAAoJEILEb/54YlRx/9gP/3gHoFevNRycvn0VpKqdufCI
 Mxy2LBBLlfyW2uD3+NvqvA2WWSo0Cs/LgXa04eAVxPdU7k48s8w+54U23wSouzjW
 gfwAmuHxzDR8v0h8X3h6BxNzmkIQHtmDcQlA/cZdHejY/UUw01yxRGNUUZDNbxlm
 WXn2nmlBLmGqXTYq0fpBV+3jicUghJqHHsBCqa3VR2yQioHMJG01F4UZMqYTZunN
 OIvDUghxByKz6alzdCqlLl1Y0exV6vwWUAzBsl1qHqmHu/bWFSZn3ujNNVrjqHhw
 Kl7/8dC2pQkv3Zo3gEVvfQ0onotwWZxGHzPQRdvmxvRnBunQVCi/wynx90yABX/r
 PPb/iBNV0mZskbF0zb0GZT3ZZWGA8Z0p3o5JQv2jV4m62qTzx8w50Y5kbn9N1WT+
 5bre7AVbVAlGonWszcS9iE+6TOboRz9OD1CCwPFXHItFutlBkau+1hHfFoLM0o9n
 LhpGuyszT/EUa1BHkLzuCckFqO2DpbF3N2CKmuTekw0CdgdsvRL2pRByuerk3j7R
 WQhlcvBq5YH6j43AuoEZKp8r1iN8oG/iqlrMYQaYWrW9hJaoQOoU8dGJxp/e7gKN
 r/qeYjETI+tIsjCbtH5WQzzxDI3gPISAYAtfqs7G34EEo+Lwp6kyRUAF4kDot2V3
 ZIyuKMmTu4cdwDETr/O+
 =7jTj
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael Wysocki:
 "The rework of backlight interface selection API from Hans de Goede
  stands out from the number of commits and the number of affected
  places perspective.  The cpufreq core fixes from Viresh Kumar are
  quite significant too as far as the number of commits goes and because
  they should reduce CPU online/offline overhead quite a bit in the
  majority of cases.

  From the new featues point of view, the ACPICA update (to upstream
  revision 20150515) adding support for new ACPI 6 material to ACPICA is
  the one that matters the most as some new significant features will be
  based on it going forward.  Also included is an update of the ACPI
  device power management core to follow ACPI 6 (which in turn reflects
  the Windows' device PM implementation), a PM core extension to support
  wakeup interrupts in a more generic way and support for the ACPI _CCA
  device configuration object.

  The rest is mostly fixes and cleanups all over and some documentation
  updates, including new DT bindings for Operating Performance Points.

  There is one fix for a regression introduced in the 4.1 cycle, but it
  adds quite a number of lines of code, it wasn't really ready before
  Thursday and you were on vacation, so I refrained from pushing it on
  the last minute for 4.1.

  Specifics:

   - ACPICA update to upstream revision 20150515 including basic support
     for ACPI 6 features: new ACPI tables introduced by ACPI 6 (STAO,
     XENV, WPBT, NFIT, IORT), changes related to the other tables (DTRM,
     FADT, LPIT, MADT), new predefined names (_BTH, _CR3, _DSD, _LPI,
     _MTL, _PRR, _RDI, _RST, _TFP, _TSN), fixes and cleanups (Bob Moore,
     Lv Zheng).

   - ACPI device power management core code update to follow ACPI 6
     which reflects the ACPI device power management implementation in
     Windows (Rafael J Wysocki).

   - rework of the backlight interface selection logic to reduce the
     number of kernel command line options and improve the handling of
     DMI quirks that may be involved in that and to make the code
     generally more straightforward (Hans de Goede).

   - fixes for the ACPI Embedded Controller (EC) driver related to the
     handling of EC transactions (Lv Zheng).

   - fix for a regression related to the ACPI resources management and
     resulting from a recent change of ACPI initialization code ordering
     (Rafael J Wysocki).

   - fix for a system initialization regression related to ACPI
     introduced during the 3.14 cycle and caused by running the code
     that switches the platform over to the ACPI mode too early in the
     initialization sequence (Rafael J Wysocki).

   - support for the ACPI _CCA device configuration object related to
     DMA cache coherence (Suravee Suthikulpanit).

   - ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).

   - ACPI battery driver cleanups (Luis Henriques, Mathias Krause).

   - ACPI processor driver cleanups (Hanjun Guo).

   - cleanups and documentation update related to the ACPI device
     properties interface based on _DSD (Rafael J Wysocki).

   - ACPI device power management fixes (Rafael J Wysocki).

   - assorted cleanups related to ACPI (Dominik Brodowski, Fabian
     Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).

   - fix for a long-standing issue causing General Protection Faults to
     be generated occasionally on return to user space after resume from
     ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).

   - fix to make the suspend core code return -EBUSY consistently in all
     cases when system suspend is aborted due to wakeup detection (Ruchi
     Kandoi).

   - support for automated device wakeup IRQ handling allowing drivers
     to make their PM support more starightforward (Tony Lindgren).

   - new tracepoints for suspend-to-idle tracing and rework of the
     prepare/complete callbacks tracing in the PM core (Todd E Brandt,
     Rafael J Wysocki).

   - wakeup sources framework enhancements (Jin Qian).

   - new macro for noirq system PM callbacks (Grygorii Strashko).

   - assorted cleanups related to system suspend (Rafael J Wysocki).

   - cpuidle core cleanups to make the code more efficient (Rafael J
     Wysocki).

   - powernv/pseries cpuidle driver update (Shilpasri G Bhat).

   - cpufreq core fixes related to CPU online/offline that should reduce
     the overhead of these operations quite a bit, unless the CPU in
     question is physically going away (Viresh Kumar, Saravana Kannan).

   - serialization of cpufreq governor callbacks to avoid race
     conditions in some cases (Viresh Kumar).

   - intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
     Bhargava, Joe Konno).

   - cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
     Holla, Felipe Balbi, Tang Yuantian).

   - assorted cleanups in cpufreq drivers and core (Shailendra Verma,
     Fabian Frederick, Wang Long).

   - new Device Tree bindings for representing Operating Performance
     Points (Viresh Kumar).

   - updates for the common clock operations support code in the PM core
     (Rajendra Nayak, Geert Uytterhoeven).

   - PM domains core code update (Geert Uytterhoeven).

   - Intel Knights Landing support for the RAPL (Running Average Power
     Limit) power capping driver (Dasaratharaman Chandramouli).

   - fixes related to the floor frequency setting on Atom SoCs in the
     RAPL power capping driver (Ajay Thomas).

   - runtime PM framework documentation update (Ben Dooks).

   - cpupower tool fix (Herton R Krzesinski)"

* tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (194 commits)
  cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
  x86: Load __USER_DS into DS/ES after resume
  PM / OPP: Add binding for 'opp-suspend'
  PM / OPP: Allow multiple OPP tables to be passed via DT
  PM / OPP: Add new bindings to address shortcomings of existing bindings
  ACPI: Constify ACPI device IDs in documentation
  ACPI / enumeration: Document the rules regarding the PRP0001 device ID
  ACPI / video: Make acpi_video_unregister_backlight() private
  acpi-video-detect: Remove old API
  toshiba-acpi: Port to new backlight interface selection API
  thinkpad-acpi: Port to new backlight interface selection API
  sony-laptop: Port to new backlight interface selection API
  samsung-laptop: Port to new backlight interface selection API
  msi-wmi: Port to new backlight interface selection API
  msi-laptop: Port to new backlight interface selection API
  intel-oaktrail: Port to new backlight interface selection API
  ideapad-laptop: Port to new backlight interface selection API
  fujitsu-laptop: Port to new backlight interface selection API
  eeepc-laptop: Port to new backlight interface selection API
  dell-wmi: Port to new backlight interface selection API
  ...
2015-06-23 14:18:07 -07:00
Rusty Russell b6c09b512d modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
Andreas turned this option on, only to find out Debian (and Ubuntu!)
don't enable support in their kmod builds.

Shorten the text, and suggest N at the bottom (at least for now).

Reported-by: Andreas Mohr <andim2@users.sf.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-06-23 15:27:36 +09:30
Linus Torvalds c58267e9fa Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes mostly consist of work on x86 PMU drivers:

   - x86 Intel PT (hardware CPU tracer) improvements (Alexander
     Shishkin)

   - x86 Intel CQM (cache quality monitoring) improvements (Thomas
     Gleixner)

   - x86 Intel PEBSv3 support (Peter Zijlstra)

   - x86 Intel PEBS interrupt batching support for lower overhead
     sampling (Zheng Yan, Kan Liang)

   - x86 PMU scheduler fixes and improvements (Peter Zijlstra)

  There's too many tooling improvements to list them all - here are a
  few select highlights:

  'perf bench':

      - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
        measure parallel waker threads generating contention for kernel
        locks (hb->lock). (Davidlohr Bueso)

  'perf top', 'perf report':

      - Allow disabling/enabling events dynamicaly in 'perf top':
        a 'perf top' session can instantly become a 'perf report'
        one, i.e. going from dynamic analysis to a static one,
        returning to a dynamic one is possible, to toogle the
        modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

      - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
        perf.data files (Namhyung Kim)

  'perf probe': (Masami Hiramatsu)

      - Support glob wildcards for function name
      - Support $params special probe argument: Collect all function arguments
      - Make --line checks validate C-style function name.
      - Add --no-inlines option to avoid searching inline functions
      - Greatly speed up 'perf probe --list' by caching debuginfo.
      - Improve --filter support for 'perf probe', allowing using its arguments
        on other commands, as --add, --del, etc.

  'perf sched':

      - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

  Plus tons of infrastructure work - in particular preparation for
  upcoming threaded perf report support, but also lots of other work -
  and fixes and other improvements.  See (much) more details in the
  shortlog and in the git log"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
  perf tools: Configurable per thread proc map processing time out
  perf tools: Add time out to force stop proc map processing
  perf report: Fix sort__sym_cmp to also compare end of symbol
  perf hists browser: React to unassigned hotkey pressing
  perf top: Tell the user how to unfreeze events after pressing 'f'
  perf hists browser: Honour the help line provided by builtin-{top,report}.c
  perf hists browser: Do not exit when 'f' is pressed in 'report' mode
  perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
  perf annotate: Rename source_line_percent to source_line_samples
  perf annotate: Display total number of samples with --show-total-period
  perf tools: Ensure thread-stack is flushed
  perf top: Allow disabling/enabling events dynamicly
  perf evlist: Add toggle_enable() method
  perf trace: Fix race condition at the end of started workloads
  perf probe: Speed up perf probe --list by caching debuginfo
  perf probe: Show usage even if the last event is skipped
  perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
  perf tools: Fix a problem when opening old perf.data with different byte order
  perf tools: Ignore .config-detected in .gitignore
  perf probe: Fix to return error if no probe is added
  ...
2015-06-22 15:19:21 -07:00
Rafael J. Wysocki b064a8fa77 ACPI / init: Switch over platform to the ACPI mode later
Commit 73f7d1ca32 "ACPI / init: Run acpi_early_init() before
timekeeping_init()" moved the ACPI subsystem initialization,
including the ACPI mode enabling, to an earlier point in the
initialization sequence, to allow the timekeeping subsystem
use ACPI early.  Unfortunately, that resulted in boot regressions
on some systems and the early ACPI initialization was moved toward
its original position in the kernel initialization code by commit
c4e1acbb35 "ACPI / init: Invoke early ACPI initialization later".

However, that turns out to be insufficient, as boot is still broken
on the Tyan S8812 mainboard.

To fix that issue, split the ACPI early initialization code into
two pieces so the majority of it still located in acpi_early_init()
and the part switching over the platform into the ACPI mode goes into
a new function, acpi_subsystem_init(), executed at the original early
ACPI initialization spot.

That fixes the Tyan S8812 boot problem, but still allows ACPI
tables to be loaded earlier which is useful to the EFI code in
efi_enter_virtual_mode().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141
Fixes: 73f7d1ca32 "ACPI / init: Run acpi_early_init() before timekeeping_init()"
Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
2015-06-10 23:51:27 +02:00
Tejun Heo 89e9b9e07a writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Peter Zijlstra 6c9692e2d6 module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
Andrew worried about the overhead on small systems; only use the fancy
code when either perf or tracing is enabled.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-05-28 11:32:07 +09:30
Pranith Kumar e72aeafc66 rcu: Remove prompt for RCU implementation
The RCU implementation is chosen based on PREEMPT and SMP config options
and is not really a user-selectable choice.  This commit removes the
menu entry, given that there is not much point in calling something a
choice when there is in fact no choice..  The TINY_RCU, TREE_RCU, and
PREEMPT_RCU Kconfig options continue to be selected based solely on the
values of the PREEMPT and SMP options.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:59:06 -07:00
Paul E. McKenney 26730f55c2 rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO
This commit updates the initialization of the kthread_prio boot parameter
so that RCU will build even when CONFIG_RCU_KTHREAD_PRIO is undefined.
The kthread_prio boot parameter is set to CONFIG_RCU_KTHREAD_PRIO if
that is defined, otherwise to 1 if CONFIG_RCU_BOOST is defined and
to zero otherwise.  This commit then makes CONFIG_RCU_KTHREAD_PRIO
depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_KTHREAD_PRIO unless they want to be.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:06 -07:00
Paul E. McKenney 47d631af58 rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so
that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined.
The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF
when defined, otherwise it is set to 32 for 32-bit systems and 64 for
64-bit systems.  This commit then makes CONFIG_RCU_FANOUT_LEAF depend
on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_FANOUT_LEAF unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:05 -07:00
Paul E. McKenney 05c5df31af rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT
This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will
build even when CONFIG_RCU_FANOUT is undefined.  The RCU_FANOUT macro is
set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set
to 32 for 32-bit systems and 64 for 64-bit systems.  This commit then
makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig
users won't be asked about CONFIG_RCU_FANOUT unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:05 -07:00
Paul E. McKenney 8739c5cb0f rcu: Break dependency of RCU_FANOUT_LEAF on RCU_FANOUT
RCU_FANOUT_LEAF's range and default values depend on the value of
RCU_FANOUT, which at the time seemed like a cute way to save two lines
of Kconfig code.  However, adding a dependency from both of these
Kconfig parameters on RCU_EXPERT requires that RCU_FANOUT_LEAF operate
correctly even if RCU_FANOUT is undefined.  This commit therefore
allows RCU_FANOUT_LEAF to take on the full range of permitted values,
even in cases where RCU_FANOUT is undefined.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Eliminate redundant "default" as suggested by Pranith Kumar. ]
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:05 -07:00
Paul E. McKenney 78cae10b3a rcu: Create RCU_EXPERT Kconfig and hide booleans behind it
This commit creates an RCU_EXPERT Kconfig and hides the independent
boolean RCU-related user-visible Kconfig parameters behind it, namely
RCU_FAST_NO_HZ and RCU_BOOST.  This prevents Kconfig from asking about
these parameters unless the user really wants to be asked.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:04 -07:00
Paul E. McKenney 7fa270010e rcu: Convert CONFIG_RCU_FANOUT_EXACT to boot parameter
The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and
perhaps only) by rcutorture to verify that RCU works correctly in specific
rcu_node combining-tree configurations.  It therefore does not make
much sense have this as a question to people attempting to configure
their kernels.  So this commit creates an rcutree.rcu_fanout_exact=
boot parameter that rcutorture can use, and eliminates the original
CONFIG_RCU_FANOUT_EXACT Kconfig parameter.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:04 -07:00
Paul E. McKenney 7db21edfec rcu: Directly drive RCU_USER_QS from Kconfig
Currently, Kconfig will ask the user whether RCU_USER_QS should be set.
This is silly because Kconfig already has all the information that it
needs to set this parameter.  This commit therefore directly drives
the value of RCU_USER_QS via NO_HZ_FULL's "select" statement.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
2015-05-27 12:59:03 -07:00
Paul E. McKenney 82d0f4c089 rcu: Directly drive TASKS_RCU from Kconfig
Currently, Kconfig will ask the user whether TASKS_RCU should be set.
This is silly because Kconfig already has all the information that it
needs to set this parameter.  This commit therefore directly drives
the value of TASKS_RCU via "select" statements.  Which means that
as subsystems require TASKS_RCU, those subsystems will need to add
"select" statements of their own.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:03 -07:00
Ingo Molnar 8d12ded3dd Merge branch 'perf/urgent' into perf/core, before applying dependent patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 09:17:21 +02:00
Tejun Heo d59cfc09c3 sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes.  This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.

This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper.  This patch converts one-to-one.

This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-05-26 20:35:00 -04:00
Luis R. Rodriguez ecc8617053 module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.

@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
 extern char *parse_args(const char *name,
 			 char *args,
 			 const struct kernel_param *params,
 			 unsigned num,
 			 s16 level_min,
 			 s16 level_max,
+			 void *arg,
 			 int (*unknown)(char *param, char *val,
					const char *doing
+					, void *arg
					));

@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
 char *parse_args(const char *name,
 			 char *args,
 			 const struct kernel_param *params,
 			 unsigned num,
 			 s16 level_min,
 			 s16 level_max,
+			 void *arg,
 			 int (*unknown)(char *param, char *val,
					const char *doing
+					, void *arg
					))
{
	...
}

@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@

(
	R =
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   func);
|
	R =
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   &func);
|
	R =
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   NULL);
|
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   func);
|
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   &func);
|
	parse_args(E1, E2, E3, E4, E5, E6,
+		   NULL,
		   NULL);
)

@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@

int func(char *param, char *val, const char *unused
+		 , void *arg
		 )
{
	...
}

@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@

-	func(A1, A2, A3);
+	func(A1, A2, A3, NULL);

Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-20 00:25:24 -07:00
Michael Ellerman cb30711374 perf_event: Don't allow vmalloc() backed perf on powerpc
On powerpc the perf event interrupt is not masked when interrupts are
disabled, allowing it to function as an NMI.

This causes problems if perf is using vmalloc. If we take a page fault
on the vmalloc region the fault handler will fail the page fault because
it detects we are coming in from an NMI (see do_hash_page()).

We don't actually need or want vmalloc backed perf so just disable it on
powerpc.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linuxppc-dev@ozlabs.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@ghostprotocols.net
Cc: sukadev@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1430720799-18426-1-git-send-email-mpe@ellerman.id.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:26:01 +02:00
Chen Yu cb31ef485d init: fix regression by supporting devices with major:minor:offset format
Commit 283e7ad02 ("init: stricter checking of major:minor root=
values") was so strict that it exposed the fact that a previously
unknown device format was being used.

Distributions like Ubuntu uses klibc (rather than uswsusp) to resume
system from hibernation.  klibc expressed the swap partition/file in
the form of major:minor:offset.  For example, 8:3:0 represents a swap
partition in klibc, and klibc's resume process in initrd will finally
echo 8:3:0 to /sys/power/resume for manually resuming.  However, due
to commit 283e7ad02's stricter checking, 8:3:0 will be treated as an
invalid device format, and manual resuming from hibernation will fail.

Fix this by adding support for devices with major:minor:offset format
when resuming from hibernation.

Reported-by: Prigent, Christophe <christophe.prigent@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-05 12:31:37 -04:00
Linus Torvalds d6a24d0640 The documentation tree update for 4.1. Numerous fixes, the overdue removal
of the i2o docs, some new Chinese translations, and, hopefully, the README
 fix that will end the flow of identical patches to that file.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVMOeZAAoJEI3ONVYwIuV6/1YQAJwcVvd3ow6cMKuf8eRKMgjd
 crJUdRF9FTwyY21SRHBaonyiKOthnVedOUYnFQ5Z7jbII0EohJ//72nW0pQrHoGi
 0avkbM+ZAZzXfd/paOiZ5HtYkc8Bdc70mPU1fzfexnPm/JACOGznxQsob05r6/sT
 W1GyJcrLlp4uPrba9rhAdGtaa+mEFrt4SVCI+odXOnxQ/KOZSIu1n1F3bSSvL9zV
 YMBgRrCso+Cdtuhe4N3O1jsVy/hyOnvtqcUgwlD4VzElsshKvxdxHn47yeeWK1qI
 zZfGTv+q5QI/eZgIhBIrBpdCafgLipAmNZhX+M76xeNydhfhp60VizRgKfb6JiW1
 M+nLSnv2vxh4wgkJs7yWgW+8TJ0eCF/w2P/mBi6hDXuIul9gwgmNk+EZ7LONSh+2
 l3PV/dyswNs04qbYgFt2rygsxcg79RRbD54zi+S/3NU38/gh7nlidASKmNu1boG/
 KdZx3F0rmX/xQu4aQ5nIQl2N7sVLkEec+oN+ukQGyBTLVHkfAK06Z4EWUeYmmBKh
 M6gqRY5XJMtCm8D5bons/yZmwmpdZFZMxFGJ4enUwrfsJ8FQ8qy/KmFqF8SojGWQ
 HYs4ZUptz6SYa7K0Txe/q0pkrW+doy7t/Bz+JBNNdG7eLeHIpKhSqSlLIdB7MRKw
 NFA8a4PdgdMpf+Zr4bRy
 =LDbk
 -----END PGP SIGNATURE-----

Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6

Pull documentation updates from Jonathan Corbet:
 "Numerous fixes, the overdue removal of the i2o docs, some new Chinese
  translations, and, hopefully, the README fix that will end the flow of
  identical patches to that file"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
  Documentation/memcg: update memcg/kmem status
  Documentation: blackfin: Makefile: Typo building issue
  Documentation/vm/pagemap.txt: correct location of page-types tool
  Documentation/memory-barriers.txt: typo fix
  doc: Add guest_nice column to example output of `cat /proc/stat'
  Documentation/kernel-parameters: Move "eagerfpu" to its right place
  Documentation: gpio: Update ACPI part of the document to mention _DSD
  docs/completion.txt: Various tweaks and corrections
  doc: completion: context, scope and language fixes
  Documentation:Update Documentation/zh_CN/arm64/memory.txt
  Documentation:Update Documentation/zh_CN/arm64/booting.txt
  Documentation: Chinese translation of arm64/legacy_instructions.txt
  DocBook media: fix broken EIA hyperlink
  Documentation: tweak the maintainers entry
  README: Change gzip/bzip2 to xz compression format
  README: Update version number reference
  doc:pci: Fix typo in Documentation/PCI
  Documentation: drm: Use '->' when describing access through pointers.
  Documentation: Remove mentioning of block barriers
  Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
  ...
2015-04-18 11:10:49 -04:00
Linus Torvalds afad97eee4 - Most extensive changes this cycle are the DM core improvements to add
full blk-mq support to request-based DM.
   - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
   - depends on some blk-mq changes from Jens' for-4.1/core branch so
     that explains why this pull is built on linux-block.git
 
 - Update DM to use name_to_dev_t() rather than open-coding a less
   capable device parser.
   - includes a couple small improvements to name_to_dev_t() that offer
     stricter constraints that DM's code provided.
 
 - Improvements to the dm-cache "mq" cache replacement policy.
 
 - A DM crypt crypt_ctr() error path fix and an async crypto deadlock fix.
 
 - A small efficiency improvement for DM crypt decryption by leveraging
   immutable biovecs.
 
 - Add error handling modes for corrupted blocks to DM verity.
 
 - A new "log-writes" DM target from Josef Bacik that is meant for
   file system developers to test file system integrity at particular
   points in the life of a file system.
 
 - A few DM log userspace cleanups and fixes.
 
 - A few Documentation fixes (for thin, cache, crypt and switch).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVMGvhAAoJEMUj8QotnQNaOKgH/iuWyEZ+ndWbeEBZs9ZJPEgU
 sRiKjSaPujkdB+78o+2DfAYjwCchqH1Oy781PqG85asTENVkcjD7LBPaIom/A2ml
 Z9ZeCzAKbrtB3lZR8QAg4u7+90kkuZv1SeOPMWBZCEV4GYLEpV1J4/f+RgdEs1kT
 upQcH1fFoQdyHGikwAKXf85Gmi4OrjVb19yoxu2xZnfwT2sCPq0okU4yBxb1B9mL
 X/FcHUeLS9LJGewEqTD3AvWrk+J5pqj+1EeKY2kRxisp1065TaljYwetgR76Vnfb
 o0J5eovJVL3AKRTYxvr5JABWD09xq9nG1hVBiuQMN5VmmvxjDKdbBPaKI4dZTt0=
 =uVjp
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - the most extensive changes this cycle are the DM core improvements to
   add full blk-mq support to request-based DM.

    - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
    - depends on some blk-mq changes from Jens' for-4.1/core branch so
      that explains why this pull is built on linux-block.git

 - update DM to use name_to_dev_t() rather than open-coding a less
   capable device parser.

    - includes a couple small improvements to name_to_dev_t() that offer
      stricter constraints that DM's code provided.

 - improvements to the dm-cache "mq" cache replacement policy.

 - a DM crypt crypt_ctr() error path fix and an async crypto deadlock
   fix

 - a small efficiency improvement for DM crypt decryption by leveraging
   immutable biovecs

 - add error handling modes for corrupted blocks to DM verity

 - a new "log-writes" DM target from Josef Bacik that is meant for file
   system developers to test file system integrity at particular points
   in the life of a file system

 - a few DM log userspace cleanups and fixes

 - a few Documentation fixes (for thin, cache, crypt and switch)

* tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits)
  dm crypt: fix missing error code return from crypt_ctr error path
  dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
  dm crypt: leverage immutable biovecs when decrypting on read
  dm crypt: update URLs to new cryptsetup project page
  dm: add log writes target
  dm table: use bool function return values of true/false not 1/0
  dm verity: add error handling modes for corrupted blocks
  dm thin: remove stale 'trim' message documentation
  dm delay: use msecs_to_jiffies for time conversion
  dm log userspace base: fix compile warning
  dm log userspace transfer: match wait_for_completion_timeout return type
  dm table: fall back to getting device using name_to_dev_t()
  init: stricter checking of major:minor root= values
  init: export name_to_dev_t and mark name argument as const
  dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
  dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
  dm: add full blk-mq support to request-based DM
  dm: impose configurable deadline for dm_request_fn's merge heuristic
  dm sysfs: introduce ability to add writable attributes
  dm: don't start current request if it would've merged with the previous
  ...
2015-04-18 08:14:18 -04:00
Heinrich Schuchardt ff691f6e03 kernel/fork.c: new function for max_threads
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
THREAD_SIZE.

E.g.  architecture hexagon may have page size 1M and thread size 4096.
This would lead to a division by zero in the calculation of max_threads.

With this patch the buggy code is moved to a separate function
set_max_threads.  The error is not fixed.

After fixing the problem in a separate patch the new function can be
reused to adjust max_threads after adding or removing memory.

Argument mempages of function fork_init() is removed as totalram_pages is
an exported symbol.

The creation of separate patches for refactoring to a new function and for
fixing the logic was suggested by Ingo Molnar.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Iulia Manda 2813893f8b kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Dan Ehrenberg 283e7ad024 init: stricter checking of major:minor root= values
In the kernel command-line, previously, root=1:2jakshflaksjdhfa would
be accepted and interpreted just like root=1:2. This patch adds
stricter checking so that additional characters after major:minor are
rejected by root=.

The goal of this change is to help in unifying DM's interpretation of
its block device argument by using existing kernel code (name_to_dev_t).
But DM rejects malformed major:minor pairs, it seems reasonable for
root= to reject them as well.

Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:19 -04:00
Dan Ehrenberg e6e20a7a5f init: export name_to_dev_t and mark name argument as const
DM will switch its device lookup code to using name_to_dev_t() so it
must be exported.  Also, the @name argument should be marked const.

Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:18 -04:00
Linus Torvalds 1dcf58d6e6 Merge branch 'akpm' (patches from Andrew)
Merge first patchbomb from Andrew Morton:

 - arch/sh updates

 - ocfs2 updates

 - kernel/watchdog feature

 - about half of mm/

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
  Documentation: update arch list in the 'memtest' entry
  Kconfig: memtest: update number of test patterns up to 17
  arm: add support for memtest
  arm64: add support for memtest
  memtest: use phys_addr_t for physical addresses
  mm: move memtest under mm
  mm, hugetlb: abort __get_user_pages if current has been oom killed
  mm, mempool: do not allow atomic resizing
  memcg: print cgroup information when system panics due to panic_on_oom
  mm: numa: remove migrate_ratelimited
  mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
  mm: split ET_DYN ASLR from mmap ASLR
  s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
  mm: expose arch_mmap_rnd when available
  s390: standardize mmap_rnd() usage
  powerpc: standardize mmap_rnd() usage
  mips: extract logic for mmap_rnd()
  arm64: standardize mmap_rnd() usage
  x86: standardize mmap_rnd() usage
  arm: factor out mmap ASLR into mmap_rnd
  ...
2015-04-14 16:49:17 -07:00
Toshi Kani 0ddab1d2ed lib/ioremap.c: add huge I/O map capability interfaces
Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which return 1 when
I/O mappings with pud/pmd are enabled on the kernel.

ioremap_huge_init() calls arch_ioremap_pud_supported() and
arch_ioremap_pmd_supported() to initialize the capabilities at boot-time.

A new kernel option "nohugeiomap" is also added, so that user can disable
the huge I/O map capabilities when necessary.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:04 -07:00
Linus Torvalds 6c8a53c9e6 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar:
 "Core kernel changes:

   - One of the more interesting features in this cycle is the ability
     to attach eBPF programs (user-defined, sandboxed bytecode executed
     by the kernel) to kprobes.

     This allows user-defined instrumentation on a live kernel image
     that can never crash, hang or interfere with the kernel negatively.
     (Right now it's limited to root-only, but in the future we might
     allow unprivileged use as well.)

     (Alexei Starovoitov)

   - Another non-trivial feature is per event clockid support: this
     allows, amongst other things, the selection of different clock
     sources for event timestamps traced via perf.

     This feature is sought by people who'd like to merge perf generated
     events with external events that were measured with different
     clocks:

       - cluster wide profiling

       - for system wide tracing with user-space events,

       - JIT profiling events

     etc.  Matching perf tooling support is added as well, available via
     the -k, --clockid <clockid> parameter to perf record et al.

     (Peter Zijlstra)

  Hardware enablement kernel changes:

   - x86 Intel Processor Trace (PT) support: which is a hardware tracer
     on steroids, available on Broadwell CPUs.

     The hardware trace stream is directly output into the user-space
     ring-buffer, using the 'AUX' data format extension that was added
     to the perf core to support hardware constraints such as the
     necessity to have the tracing buffer physically contiguous.

     This patch-set was developed for two years and this is the result.
     A simple way to make use of this is to use BTS tracing, the PT
     driver emulates BTS output - available via the 'intel_bts' PMU.
     More explicit PT specific tooling support is in the works as well -
     will probably be ready by 4.2.

     (Alexander Shishkin, Peter Zijlstra)

   - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
     feature of Intel Xeon CPUs that allows the measurement and
     allocation/partitioning of caches to individual workloads.

     These kernel changes expose the measurement side as a new PMU
     driver, which exposes various QoS related PMU events.  (The
     partitioning change is work in progress and is planned to be merged
     as a cgroup extension.)

     (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
     Waskiewicz Jr)

   - x86 Intel Haswell LBR call stack support: this is a new Haswell
     feature that allows the hardware recording of call chains, plus
     tooling support.  To activate this feature you have to enable it
     via the new 'lbr' call-graph recording option:

        perf record --call-graph lbr
        perf report

     or:

        perf top --call-graph lbr

     This hardware feature is a lot faster than stack walk or dwarf
     based unwinding, but has some limitations:

       - It reuses the current LBR facility, so LBR call stack and
         branch record can not be enabled at the same time.

       - It is only available for user-space callchains.

     (Yan, Zheng)

   - x86 Intel Broadwell CPU support and various event constraints and
     event table fixes for earlier models.

     (Andi Kleen)

   - x86 Intel HT CPUs event scheduling workarounds.  This is a complex
     CPU bug affecting the SNB,IVB,HSW families that results in counter
     value corruption.  The mitigation code is automatically enabled and
     is transparent.

     (Maria Dimakopoulou, Stephane Eranian)

  The perf tooling side had a ton of changes in this cycle as well, so
  I'm only able to list the user visible changes here, in addition to
  the tooling changes outlined above:

  User visible changes affecting all tools:

      - Improve support of compressed kernel modules (Jiri Olsa)
      - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
      - Bash completion for subcommands (Yunlong Song)
      - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
      - Support missing -f to override perf.data file ownership. (Yunlong Song)
      - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)

  User visible changes in individual tools:

    'perf data':

        New tool for converting perf.data to other formats, initially
        for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
        Sebastian Siewior)

    'perf diff':

        Add --kallsyms option (David Ahern)

    'perf list':

        Allow listing events with 'tracepoint' prefix (Yunlong Song)

        Sort the output of the command (Yunlong Song)

    'perf kmem':

        Respect -i option (Jiri Olsa)

        Print big numbers using thousands' group (Namhyung Kim)

        Allow -v option (Namhyung Kim)

        Fix alignment of slab result table (Namhyung Kim)

    'perf probe':

        Support multiple probes on different binaries on the same command line (Masami Hiramatsu)

        Support unnamed union/structure members data collection. (Masami Hiramatsu)

        Check kprobes blacklist when adding new events. (Masami Hiramatsu)

    'perf record':

        Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)

        Support recording running/enabled time (Andi Kleen)

    'perf sched':

        Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)

    'perf report' and 'perf top':

        Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)

        Indicate which callchain entries are annotated in the
        TUI hists browser (Arnaldo Carvalho de Melo)

        Add pid/tid filtering to 'report' and 'script' commands (David Ahern)

        Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
        cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
        events (Arnaldo Carvalho de Melo)

    'perf stat':

        Report unsupported events properly (Suzuki K. Poulose)

        Output running time and run/enabled ratio in CSV mode (Andi Kleen)

    'perf trace':

        Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)

        Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)

        Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)

        Dump stack on segfaults (Arnaldo Carvalho de Melo)

        No need to explicitely enable evsels for workload started from perf, let it
        be enabled via perf_event_attr.enable_on_exec, removing some events that take
        place in the 'perf trace' before a workload is really started by it.
        (Arnaldo Carvalho de Melo)

        Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)

  There's also been a ton of infrastructure work done, such as the
  split-out of perf's build system into tools/build/ and other changes -
  see the shortlog and changelog for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
  perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
  perf evlist: Fix type for references to data_head/tail
  perf probe: Check the orphaned -x option
  perf probe: Support multiple probes on different binaries
  perf buildid-list: Fix segfault when show DSOs with hits
  perf tools: Fix cross-endian analysis
  perf tools: Fix error path to do closedir() when synthesizing threads
  perf tools: Fix synthesizing fork_event.ppid for non-main thread
  perf tools: Add 'I' event modifier for exclude_idle bit
  perf report: Don't call map__kmap if map is NULL.
  perf tests: Fix attr tests
  perf probe: Fix ARM 32 building error
  perf tools: Merge all perf_event_attr print functions
  perf record: Add clockid parameter
  perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
  perf sched replay: Support using -f to override perf.data file ownership
  perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
  perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
  perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
  perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
  ...
2015-04-14 14:37:47 -07:00
Linus Torvalds 078838d565 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "The main changes in this cycle were:

   - changes permitting use of call_rcu() and friends very early in
     boot, for example, before rcu_init() is invoked.

   - add in-kernel API to enable and disable expediting of normal RCU
     grace periods.

   - improve RCU's handling of (hotplug-) outgoing CPUs.

   - NO_HZ_FULL_SYSIDLE fixes.

   - tiny-RCU updates to make it more tiny.

   - documentation updates.

   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
  cpu: Defer smpboot kthread unparking until CPU known to scheduler
  rcu: Associate quiescent-state reports with grace period
  rcu: Yet another fix for preemption and CPU hotplug
  rcu: Add diagnostics to grace-period cleanup
  rcutorture: Default to grace-period-initialization delays
  rcu: Handle outgoing CPUs on exit from idle loop
  cpu: Make CPU-offline idle-loop transition point more precise
  rcu: Eliminate ->onoff_mutex from rcu_node structure
  rcu: Process offlining and onlining only at grace-period start
  rcu: Move rcu_report_unblock_qs_rnp() to common code
  rcu: Rework preemptible expedited bitmask handling
  rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
  rcutorture: Enable slow grace-period initializations
  rcu: Provide diagnostic option to slow down grace-period initialization
  rcu: Detect stalls caused by failure to propagate up rcu_node tree
  rcu: Eliminate empty HOTPLUG_CPU ifdef
  rcu: Simplify sync_rcu_preempt_exp_init()
  rcu: Put all orphan-callback-related code under same comment
  rcu: Consolidate offline-CPU callback initialization
  ...
2015-04-14 13:36:04 -07:00
Linus Torvalds d0bbe0dd35 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Usual trivial tree updates.  Nothing outstanding -- mostly printk()
  and comment fixes and unused identifier removals"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  goldfish: goldfish_tty_probe() is not using 'i' any more
  powerpc: Fix comment in smu.h
  qla2xxx: Fix printks in ql_log message
  lib: correct link to the original source for div64_u64
  si2168, tda10071, m88ds3103: Fix firmware wording
  usb: storage: Fix printk in isd200_log_config()
  qla2xxx: Fix printk in qla25xx_setup_mode
  init/main: fix reset_device comment
  ipwireless: missing assignment
  goldfish: remove unreachable line of code
  coredump: Fix do_coredump() comment
  stacktrace.h: remove duplicate declaration task_struct
  smpboot.h: Remove unused function prototype
  treewide: Fix typo in printk messages
  treewide: Fix typo in printk messages
  mod_devicetable: fix comment for match_flags
2015-04-14 09:50:27 -07:00
Paul E. McKenney 00df35f991 cpu: Defer smpboot kthread unparking until CPU known to scheduler
Currently, smpboot_unpark_threads() is invoked before the incoming CPU
has been added to the scheduler's runqueue structures.  This might
potentially cause the unparked kthread to run on the wrong CPU, since the
correct CPU isn't fully set up yet.

That causes a sporadic, hard to debug boot crash triggering on some
systems, reported by Borislav Petkov, and bisected down to:

  2a442c9c64 ("x86: Use common outgoing-CPU-notification code")

This patch places smpboot_unpark_threads() in a CPU hotplug
notifier with priority set so that these kthreads are unparked just after
the CPU has been added to the runqueues.

Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-13 08:25:16 +02:00
Vladimir Davydov 197175427a Documentation/memcg: update memcg/kmem status
Memcg/kmem reclaim support has been finally merged. Reflect this in the
documentation.

Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-04-11 15:23:31 +02:00
Ingo Molnar e1abf2cc8d bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable
So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
dependency, it also depends on the tracing infrastructure and on kprobes,
without which it will fail to build with:

  In file included from kernel/trace/bpf_trace.c:14:0:
  kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
  kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
    unsigned int val = current->trace_recursion;
  [...]

It took quite some time to trigger this build failure, because right now
BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
more configurable, not just under CONFIG_EXPERT.

If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
gateway as well.

We might want to make this an interactive option later on, although
I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
an indicator that the user wants BPF support.

Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 16:28:06 +02:00
Frans Klaver fc454fdc10 init/main: fix reset_device comment
Fix incorrect spelling of situation.

Signed-off-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-06 23:12:43 +01:00
Zefan Li 695df2132c cpuset: initialize cpuset a bit early
Now we call ss->bind() in cgroup_init(), so cgroup_init() will
call cpuset_bind() and then the latter will access top_cpuset's
cpumask, which is NULL, because cpuset_init() is called after
cgroup_init()

The simplest fix is to swap cgroup_init() and cpuset_init().

Cc: Vladimir Davydov <vdavydov@parallels.com>
Fixes: 295458e672 ("cgroup: call cgroup_subsys->bind on cgroup subsys initialization")
Reported by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
2015-03-05 08:05:20 -05:00
Paul E. McKenney ee42571f43 rcu: Add Kconfig option to expedite grace periods during boot
This commit adds a CONFIG_RCU_EXPEDITE_BOOT Kconfig parameter
that emulates a very early boot rcu_expedite_gp().  A late-boot
call to rcu_end_inkernel_boot() will provide the corresponding
rcu_unexpedite_gp().  The late-boot call to rcu_end_inkernel_boot()
should be made just before init is spawned.

According to Arjan:

> To show the boot time, I'm using the timestamp of the "Write protecting"
> line, that's pretty much the last thing we print prior to ring 3 execution.
>
> A kernel with default RCU behavior (inside KVM, only virtual devices)
> looks like this:
>
> [    0.038724] Write protecting the kernel read-only data: 10240k
>
> a kernel with expedited RCU (using the command line option, so that I
> don't have to recompile between measurements and thus am completely
> oranges-to-oranges)
>
> [    0.031768] Write protecting the kernel read-only data: 10240k
>
> which, in percentage, is an 18% improvement.

Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Arjan van de Ven <arjan@linux.intel.com>
2015-02-26 12:03:03 -08:00
Linus Torvalds b11a278397 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
 "Yann E Morin was supposed to take over kconfig maintainership, but
  this hasn't happened.  So I'm sending a few kconfig patches that I
  collected:

   - Fix for missing va_end in kconfig
   - merge_config.sh displays used if given too few arguments
   - s/boolean/bool/ in Kconfig files for consistency, with the plan to
     only support bool in the future"

* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kconfig: use va_end to match corresponding va_start
  merge_config.sh: Display usage if given too few arguments
  kconfig: use bool instead of boolean for type definition attributes
2015-02-19 10:36:45 -08:00
Linus Torvalds 7734334337 Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull misc kbuild changes from Michal Marek:
 "Just a few non-critical kbuild changes:

   - builddeb adds the actual distribution name in the changelog
   - documentation fixes"

* 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kbuild: trivial - fix the help doc of CONFIG_CC_OPTIMIZE_FOR_SIZE
  kbuild: Update documentation of clean-files and clean-dirs
  builddeb: Try to determine distribution
  builddeb: Update year and git repository URL in debian/copyright
2015-02-19 10:31:37 -08:00
Andy Lutomirski 5125991c9a init: remove CONFIG_INIT_FALLBACK
CONFIG_INIT_FALLBACK adds config bloat without an obvious use case that
makes it worth keeping around.  Delete it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: Shuah Khan <shuah.kh@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:42 -08:00
Linus Torvalds 9d43bade34 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Ingo Molnar:
 "Continued fallout of the conversion of the x86 IRQ code to the
  hierarchical irqdomain framework: more cleanups, simplifications,
  memory allocation behavior enhancements, mainly in the interrupt
  remapping and APIC code"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  x86, init: Fix UP boot regression on x86_64
  iommu/amd: Fix irq remapping detection logic
  x86/acpi: Make acpi_[un]register_gsi_ioapic() depend on CONFIG_X86_LOCAL_APIC
  x86: Consolidate boot cpu timer setup
  x86/apic: Reuse apic_bsp_setup() for UP APIC setup
  x86/smpboot: Sanitize uniprocessor init
  x86/smpboot: Move apic init code to apic.c
  init: Get rid of x86isms
  x86/apic: Move apic_init_uniprocessor code
  x86/smpboot: Cleanup ioapic handling
  x86/apic: Sanitize ioapic handling
  x86/ioapic: Add proper checks to setp/enable_IO_APIC()
  x86/ioapic: Provide stub functions for IOAPIC%3Dn
  x86/smpboot: Move smpboot inlines to code
  x86/x2apic: Use state information for disable
  x86/x2apic: Split enable and setup function
  x86/x2apic: Disable x2apic from nox2apic setup
  x86/x2apic: Add proper state tracking
  x86/x2apic: Clarify remapping mode for x2apic enablement
  x86/x2apic: Move code in conditional region
  ...
2015-02-09 16:57:56 -08:00
Thomas Gleixner 30b8b0066c init: Get rid of x86isms
The UP local API support can be set up from an early initcall. No need
for horrible hackery in the init code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211703.827943883@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-22 15:10:56 +01:00
Paul E. McKenney 78e691f4ae Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD
doc.2015.01.07a: Documentation updates.
fixes.2015.01.15a: Miscellaneous fixes.
preempt.2015.01.06a: Changes to handling of lists of preempted tasks.
srcu.2015.01.06a: SRCU updates.
stall.2015.01.16a: RCU CPU stall-warning updates and fixes.
torture.2015.01.11a: RCU torture-test updates and fixes.
2015-01-15 23:34:34 -08:00
Paul E. McKenney a94844b22a rcu: Optionally run grace-period kthreads at real-time priority
Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings.  However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.

Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default.  The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter.  Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15 23:25:04 -08:00
Masahiro Yamada 31a4af7f7d kbuild: trivial - fix the help doc of CONFIG_CC_OPTIMIZE_FOR_SIZE
Other than GCC, we have another choice, Clang for building the kernel
these days.  It seems better to say "compiler" rather than "gcc".

Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-08 14:53:50 +01:00
Christoph Jaeger 6341e62b21 kconfig: use bool instead of boolean for type definition attributes
Support for keyword 'boolean' will be dropped later on.

No functional change.

Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-07 13:08:04 +01:00
Pranith Kumar 83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
Lai Jiangshan 5a43b88e98 rcu: Remove "select IRQ_WORK" from config TREE_RCU
The 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
removed the irq_work_queue(), so the TREE_RCU doesn't need
irq work any more.  This commit therefore updates RCU's Kconfig and

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:17 -08:00
Miklos Szeredi 10975933da init: fix read-write root mount
If mount flags don't have MS_RDONLY, iso9660 returns EACCES without actually
checking if it's an iso image.

This tricks mount_block_root() into retrying with MS_RDONLY.  This results
in a read-only root despite the "rw" boot parameter if the actual
filesystem was checked after iso9660.

I believe the behavior of iso9660 is okay, while that of mount_block_root()
is not.  It should rather try all types without MS_RDONLY and only then
retry with MS_RDONLY.

This change also makes the code more robust against the case when EACCES is
returned despite MS_RDONLY, which would've resulted in a lockup.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 08:27:14 -05:00
Linus Torvalds 603ba7e41b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default ->i_fop have ->open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &...->ns
  netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
  make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
  common object embedded into various struct ....ns
2014-12-16 15:53:03 -08:00
Linus Torvalds a7c180aa7e As the merge window is still open, and this code was not as complex
as I thought it might be. I'm pushing this in now.
 
 This will allow Thomas to debug his irq work for 3.20.
 
 This adds two new features:
 
 1) Allow traceopoints to be enabled right after mm_init(). By passing
 in the trace_event= kernel command line parameter, tracepoints can be
 enabled at boot up. For debugging things like the initialization of
 interrupts, it is needed to have tracepoints enabled very early. People
 have asked about this before and this has been on my todo list. As it
 can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
 pushing this now. This way he can add tracepoints into the IRQ set up
 and have users enable them when things go wrong.
 
 2) Have the tracepoints printed via printk() (the console) when they
 are triggered. If the irq code locks up or reboots the box, having the
 tracepoint output go into the kernel ring buffer is useless for
 debugging. But being able to add the tp_printk kernel command line
 option along with the trace_event= option will have these tracepoints
 printed as they occur, and that can be really useful for debugging
 early lock up or reboot problems.
 
 This code is not that intrusive and it passed all my tests. Thomas tried
 them out too and it works for his needs.
 
  Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
 NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
 sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
 jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
 oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
 6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
 =Z4Un
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "As the merge window is still open, and this code was not as complex as
  I thought it might be.  I'm pushing this in now.

  This will allow Thomas to debug his irq work for 3.20.

  This adds two new features:

  1) Allow traceopoints to be enabled right after mm_init().

     By passing in the trace_event= kernel command line parameter,
     tracepoints can be enabled at boot up.  For debugging things like
     the initialization of interrupts, it is needed to have tracepoints
     enabled very early.  People have asked about this before and this
     has been on my todo list.  As it can be helpful for Thomas to debug
     his upcoming 3.20 IRQ work, I'm pushing this now.  This way he can
     add tracepoints into the IRQ set up and have users enable them when
     things go wrong.

  2) Have the tracepoints printed via printk() (the console) when they
     are triggered.

     If the irq code locks up or reboots the box, having the tracepoint
     output go into the kernel ring buffer is useless for debugging.
     But being able to add the tp_printk kernel command line option
     along with the trace_event= option will have these tracepoints
     printed as they occur, and that can be really useful for debugging
     early lock up or reboot problems.

  This code is not that intrusive and it passed all my tests.  Thomas
  tried them out too and it works for his needs.

   Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"

* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add tp_printk cmdline to have tracepoints go to printk()
  tracing: Move enabling tracepoints to just after rcu_init()
2014-12-16 12:53:59 -08:00
Steven Rostedt (Red Hat) 5f893b2639 tracing: Move enabling tracepoints to just after rcu_init()
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().

This also allows trace_printk() to be used early too.

Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-15 10:16:50 -05:00
Linus Torvalds 67e2c38838 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "In terms of changes, there's general maintenance to the Smack,
  SELinux, and integrity code.

  The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT,
  which allows IMA appraisal to require signatures.  Support for reading
  keys from rootfs before init is call is also added"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
  selinux: Remove security_ops extern
  security: smack: fix out-of-bounds access in smk_parse_smack()
  VFS: refactor vfs_read()
  ima: require signature based appraisal
  integrity: provide a hook to load keys when rootfs is ready
  ima: load x509 certificate from the kernel
  integrity: provide a function to load x509 certificate from the kernel
  integrity: define a new function integrity_read_file()
  Security: smack: replace kzalloc with kmem_cache for inode_smack
  Smack: Lock mode for the floor and hat labels
  ima: added support for new kernel cmdline parameter ima_template_fmt
  ima: allocate field pointers array on demand in template_desc_init_fields()
  ima: don't allocate a copy of template_fmt in template_desc_init_fields()
  ima: display template format in meas. list if template name length is zero
  ima: added error messages to template-related functions
  ima: use atomic bit operations to protect policy update interface
  ima: ignore empty and with whitespaces policy lines
  ima: no need to allocate entry for comment
  ima: report policy load status
  ima: use path names cache
  ...
2014-12-14 20:36:37 -08:00
Joonsoo Kim eefa864b70 mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every
page.  For this purpose, we sometimes modify struct page itself.  But,
this has drawbacks.  First, it requires re-compile.  This makes us
hesitate to use the powerful debug feature so development process is
slowed down.  And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency.  At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel.  Keeping this as it is would be better to reproduce errornous
situation.

This feature is intended to overcome above mentioned problems.  This
feature allocates memory for extended data per page in certain place
rather than the struct page itself.  This memory can be accessed by the
accessor functions provided by this code.  During the boot process, it
checks whether allocation of huge chunk of memory is needed or not.  If
not, it avoids allocating memory at all.  With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.

Until now, memcg uses this technique.  But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed.  I'd like to use this code to develop debug feature, so
this patch resurrect it.

To help these things to work well, this patch introduces two callbacks for
clients.  One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time.  The other is optional, init
callback, which is used to do proper initialization after memory is
allocated.  Detailed explanation about purpose of these functions is in
code comment.  Please refer it.

Others are completely same with previous extension code in memcg.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Al Viro 707c5960f1 Merge branch 'nsfs' into for-next 2014-12-10 21:31:59 -05:00
Al Viro e149ed2b80 take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs.  Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.).  Files on it *are* bindable - we explicitly permit that in do_loopback().

This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot.  The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).

Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present.  See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.

As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:30:20 -05:00
Andy Lutomirski 6ef4536e2f init: allow CONFIG_INIT_FALLBACK=n to disable defaults if init= fails
If a user puts init=/whatever on the command line and /whatever can't be
run, then the kernel will try a few default options before giving up.  If
init=/whatever came from a bootloader prompt, then this is unexpected but
probably harmless.  On the other hand, if it comes from a script (e.g.  a
tool like virtme or perhaps a future kselftest script), then the fallbacks
are likely to exist, but they'll do the wrong thing.  For example, they
might unexpectedly invoke systemd.

This adds a config option CONFIG_INIT_FALLBACK.  If unset, then a failure
to run the specified init= process be fatal.

The tentative plan is to remove CONFIG_INIT_FALLBACK for 3.20.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Rob Landley <rob@landley.net>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:12 -08:00
Johannes Weiner 9edad6ea0f mm: move page->mem_cgroup bad page handling into generic code
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Johannes Weiner 1306a85aed mm: embed the memcg pointer directly into struct page
Memory cgroups used to have 5 per-page pointers.  To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.

There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged.  The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory.  With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space.  Remaining users that care can
still compile their kernels without CONFIG_MEMCG.

     text    data     bss     dec     hex     filename
  8828345 1725264  983040 11536649 b00909  vmlinux.old
  8827425 1725264  966656 11519345 afc571  vmlinux.new

[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Aneesh Kumar K.V 6f7c97e80b mm/numa balancing: rearrange Kconfig entry
Add the default enable config option after the NUMA_BALANCING option so
that it appears related in the nconfig interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Johannes Weiner 5b1efc027c kernel: res_counter: remove the unused API
All memory accounting and limiting has been switched over to the
lockless page counters.  Bye, res_counter!

[akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
[mhocko@suse.cz: ditch the last remainings of res_counter]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Johannes Weiner 71f87bee38 mm: hugetlb_cgroup: convert to lockless page counters
Abandon the spinlock-protected byte counters in favor of the unlocked
page counters in the hugetlb controller as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Johannes Weiner 3e32cb2e0a mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page.  The
counter interface is also convoluted and does too many things.

Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it.  The translation from and to bytes then only
happens when interfacing with userspace.

The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:

vanilla:

   18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
         1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
            24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
     1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
 8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
 1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
     1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )

     132.474343877 seconds time elapsed                                          ( +-  0.21% )

lockless:

   12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
           832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
            15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
     1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
 9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
 2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
     1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )

      91.369330729 seconds time elapsed                                          ( +-  0.45% )

On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.

Notable differences between the old and new API:

- res_counter_charge() and res_counter_charge_nofail() become
  page_counter_try_charge() and page_counter_charge() resp. to match
  the more common kernel naming scheme of try_do()/do()

- res_counter_uncharge_until() is only ever used to cancel a local
  counter and never to uncharge bigger segments of a hierarchy, so
  it's replaced by the simpler page_counter_cancel()

- res_counter_set_limit() is replaced by page_counter_limit(), which
  expects its callers to serialize against themselves

- res_counter_memparse_write_strategy() is replaced by
  page_counter_limit(), which rounds down to the nearest page size -
  rather than up.  This is more reasonable for explicitely requested
  hard upper limits.

- to keep charging light-weight, page_counter_try_charge() charges
  speculatively, only to roll back if the result exceeds the limit.
  Because of this, a failing bigger charge can temporarily lock out
  smaller charges that would otherwise succeed.  The error is bounded
  to the difference between the smallest and the biggest possible
  charge size, so for memcg, this means that a failing THP charge can
  send base page charges into reclaim upto 2MB (4MB) before the limit
  would have been reached.  This should be acceptable.

[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Al Viro 33c429405a copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:47 -05:00
Al Viro 435d5f4bb2 common object embedded into various struct ....ns
for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:31:00 -05:00
Ingo Molnar d360b78f99 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Streamline RCU's use of per-CPU variables, shifting from "cpu"
   arguments to functions to "this_"-style per-CPU variable accessors.

 - Signal-handling RCU updates.

 - Real-time updates.

 - Torture-test updates.

 - Miscellaneous fixes.

 - Documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-20 08:57:58 +01:00