Commit Graph

1291 Commits

Author SHA1 Message Date
Davidlohr Bueso 615d6e8756 mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Linus Torvalds 24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Johannes Weiner 91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Linus Torvalds b9f2b21a32 Devicetree changes for v3.15
Updates to devicetree core code. This branch contains the following notable changes:
 * Add reserved memory binding
 * Make struct device_node a kobject and remove legacy /proc/device-tree
 * ePAPR conformance fixes
 * Update in-kernel DTC copy to version v1.4.0
 * Preparation changes for dynamic device tree overlays
 * minor bug fixes and documentation changes
 
 The most significant change in this branch is the conversion of struct
 device_node to be a kobject that is exposed via sysfs and removal of the
 old /proc/device-tree code. This simplifies the device tree handling
 code and tightens up the lifecycle on device tree nodes.
 
 [updated: added fix for dangling select PROC_DEVICETREE]
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOyNwAAoJEMWQL496c2LNZY0QAIreUrpo3/hKRau61EDPXkOA
 UFRyPUHD0k/dNXWWDbTfvKH/nAfzdVwejhePqEWiODiFOFkq7JyQlMKPA+CZuZj0
 ygN4215A1yj/hDf6JRD5Zn4WGpawDt9InlbZSps6P5dd8voV5t5dz6uzz+Y7uqaK
 CAjTDlBSmxEen5vRHiHQgKv74au/+b9yfSURjPQVWg46+wl3WJwjsdzerphm4unW
 tpEr8zkIsm51mqqAx4penIuiovh7+L2J5v4BFeg8o+kaZEuZpVxLHJPOuBd5hdom
 zeqEIj3AqHTh5suYIHe4aAbZ2wMP3kYGgkPGwfWLnwLyULxalcCtGZeaCi9nwTFj
 Fdj+7f17ocrt5mif0f5Deufi1LqJsDjhY6G9p7HuV7Y9hsMILpJIUoGENPji+TWj
 BA4L45eaPmNYdKJytEtFD7F2WnXeHZ6fDtYho/39DWW+Bt16IFX85T199irhxGG4
 byN6LRaahk2UeycSXkQHAlWOQHqzBcJJAkQLN2iahzyYRr9Dy+VI2E9clm53m49O
 YQYcONdUlMYrtfRwJpbB9XHM0HgZUvg0LT5z/iHQs9uJtoo33Oj+zxFixyZLQ9Dq
 qyLqQWEpV9gFLAo9tpf56gffkLiJRsHkX4UJ6oTtj4DY1WWU9H81jjCvv/7flzp/
 8ZyyZzANQf1DZ9kqO2v+
 =lyA5
 -----END PGP SIGNATURE-----

Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux

Pull devicetree changes from Grant Likely:
 "Updates to devicetree core code.  This branch contains the following
  notable changes:

   - add reserved memory binding
   - make struct device_node a kobject and remove legacy
     /proc/device-tree
   - ePAPR conformance fixes
   - update in-kernel DTC copy to version v1.4.0
   - preparatory changes for dynamic device tree overlays
   - minor bug fixes and documentation changes

  The most significant change in this branch is the conversion of struct
  device_node to be a kobject that is exposed via sysfs and removal of
  the old /proc/device-tree code.  This simplifies the device tree
  handling code and tightens up the lifecycle on device tree nodes.

  [updated: added fix for dangling select PROC_DEVICETREE]"

* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
  dt: Remove dangling "select PROC_DEVICETREE"
  of: Add support for ePAPR "stdout-path" property
  of: device_node kobject lifecycle fixes
  of: only scan for reserved mem when fdt present
  powerpc: add support for reserved memory defined by device tree
  arm64: add support for reserved memory defined by device tree
  of: add missing major vendors
  of: add vendor prefix for SMSC
  of: remove /proc/device-tree
  of/selftest: Add self tests for manipulation of properties
  of: Make device nodes kobjects so they show up in sysfs
  arm: add support for reserved memory defined by device tree
  drivers: of: add support for custom reserved memory drivers
  drivers: of: add initialization code for dynamic reserved memory
  drivers: of: add initialization code for static reserved memory
  of: document bindings for reserved-memory nodes
  Revert "of: fix of_update_property()"
  kbuild: dtbs_install: new make target
  ARM: mvebu: Allows to get the SoC ID even without PCI enabled
  of: Allows to use the PCI translator without the PCI core
  ...
2014-04-02 14:27:15 -07:00
Al Viro 5d826c847b new helper: readlink_copy()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:15 -04:00
Linus Torvalds a21e40877a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main purpose is to fix a full dynticks bug related to
  virtualization, where steal time accounting appears to be zero in
  /proc/stat even after a few seconds of competing guests running busy
  loops in a same host CPU.  It's not a regression though as it was
  there since the beginning.

  The other commits are preparatory work to fix the bug and various
  cleanups"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arch: Remove stub cputime.h headers
  sched: Remove needless round trip nsecs <-> tick conversion of steal time
  cputime: Fix jiffies based cputime assumption on steal accounting
  cputime: Bring cputime -> nsecs conversion
  cputime: Default implementation of nsecs -> cputime conversion
  cputime: Fix nsecs_to_cputime() return type cast
2014-04-01 10:16:10 -07:00
Grant Likely d88cf7d7b4 Merge remote-tracking branch 'robh/for-next' into devicetree/next 2014-03-31 08:10:55 +01:00
William Roberts 21a6457a79 proc: Update get proc_pid_cmdline() to use mm.h helpers
Re-factor proc_pid_cmdline() to use get_cmdline() helper
from mm.h.

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Richard Guy Briggs <rgb@redhat.com>

Signed-off-by: William Roberts <wroberts@tresys.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2014-03-20 10:10:26 -04:00
Frederic Weisbecker bfc3f0281e cputime: Default implementation of nsecs -> cputime conversion
The architectures that override cputime_t (s390, ppc) don't provide
any version of nsecs_to_cputime(). Indeed this cputime_t implementation
by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
which the core code doesn't make any use of nsecs_to_cputime().

At least for now.

We are going to make a broader use of it so lets provide a default
version with a per usecs granularity. It should be good enough for most
usecases.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-03-13 15:56:43 +01:00
Theodore Ts'o 02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Grant Likely 8357041a69 of: remove /proc/device-tree
The same data is now available in sysfs, so we can remove the code
that exports it in /proc and replace it with a symlink to the sysfs
version.

Tested on versatile qemu model and mpc5200 eval board. More testing
would be appreciated.

v5: Fixed up conflicts with mainline changes

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Pantelis Antoniou <panto@antoniou-consulting.com>
2014-03-11 20:48:32 +00:00
Artem Fetishev 70335abb26 fs/proc/base.c: fix GPF in /proc/$PID/map_files
The expected logic of proc_map_files_get_link() is either to return 0
and initialize 'path' or return an error and leave 'path' uninitialized.

By the time dname_to_vma_addr() returns 0 the corresponding vma may have
already be gone.  In this case the path is not initialized but the
return value is still 0.  This results in 'general protection fault'
inside d_path().

Steps to reproduce:

  CONFIG_CHECKPOINT_RESTORE=y

    fd = open(...);
    while (1) {
        mmap(fd, ...);
        munmap(fd, ...);
    }

  ls -la /proc/$PID/map_files

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991

Signed-off-by: Artem Fetishev <artem_fetishev@epam.com>
Signed-off-by: Aleksandr Terekhov <aleksandr_terekhov@epam.com>
Reported-by: <wiebittewas@gmail.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:20 -07:00
David Rientjes 668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Greg Pearson 38dfac843c vmcore: prevent PT_NOTE p_memsz overflow during header update
Currently, update_note_header_size_elf64() and
update_note_header_size_elf32() will add the size of a PT_NOTE entry to
real_sz even if that causes real_sz to exceeds max_sz.  This patch
corrects the while loop logic in those routines to ensure that does not
happen and prints a warning if a PT_NOTE entry is dropped.  If zero
PT_NOTE entries are found or this condition is encountered because the
only entry was dropped, a warning is printed and an error is returned.

One possible negative side effect of exceeding the max_sz limit is an
allocation failure in merge_note_headers_elf64() or
merge_note_headers_elf32() which would produce console output such as
the following while booting the crash kernel.

  vmalloc: allocation failure: 14076997632 bytes
  swapper/0: page allocation failure: order:0, mode:0x80d2
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7
  Call Trace:
    dump_stack+0x19/0x1b
    warn_alloc_failed+0xf0/0x160
    __vmalloc_node_range+0x19e/0x250
    vmalloc_user+0x4c/0x70
    merge_note_headers_elf64.constprop.9+0x116/0x24a
    vmcore_init+0x2d4/0x76c
    do_one_initcall+0xe2/0x190
    kernel_init_freeable+0x17c/0x207
    kernel_init+0xe/0x180
    ret_from_fork+0x7c/0xb0

  Kdump: vmcore not initialized

  kdump: dump target is /dev/sda4
  kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/
  kdump: saving vmcore-dmesg.txt
  Cannot open /proc/vmcore: No such file or directory
  kdump: saving vmcore-dmesg.txt failed
  kdump: saving vmcore
  kdump: saving vmcore failed

This type of failure has been seen on a four socket prototype system
with certain memory configurations.  Most PT_NOTE sections have a single
entry similar to:

  n_namesz = 0x5
  n_descsz = 0x150
  n_type   = 0x1

Occasionally, a second entry is encountered with very large n_namesz and
n_descsz sizes:

  n_namesz = 0x80000008
  n_descsz = 0x510ae163
  n_type   = 0x80000008

Not yet sure of the source of these extra entries, they seem bogus, but
they shouldn't cause crash dump to fail.

Signed-off-by: Greg Pearson <greg.pearson@hp.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:40 -08:00
Oleg Nesterov 185ee40ee7 fs/proc/array.c: change do_task_stat() to use while_each_thread()
Change the remaining next_thread (ab)users to use while_each_thread().

The last user which should be changed is next_tid(), but we can't do this
now.

__exit_signal() and complete_signal() are fine, they actually need
next_thread() logic.

This patch (of 3):

do_task_stat() can use while_each_thread(), no changes in
the compiled code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:02 -08:00
Paul Gortmaker abaf3787ac fs/proc: don't use module_init for non-modular core code
PROC_FS is a bool, so this code is either present or absent.  It will
never be modular, so using module_init as an alias for __initcall is
rather misleading.

Fix this up now, so that we can relocate module_init from init.h into
module.h in the future.  If we don't do this, we'd have to add module.h to
obviously non-modular code, and that would be ugly at best.

Note that direct use of __initcall is discouraged, vs.  one of the
priority categorized subgroups.  As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change these registrations from level 6-device to level 5-fs
(i.e.  slightly earlier).  However no observable impact of that small
difference has been observed during testing, or is expected.

Also note that this change uncovers a missing semicolon bug in the
registration of vmcore_init as an initcall.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:02 -08:00
Dave Jones c1d867a54d fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware exists.
Distribution kernels might want to build in support for /proc/device-tree
for kernels that might end up running on hardware that doesn't support
openfirmware.  This results in an empty /proc/device-tree existing.
Remove it if the OFW root node doesn't exist.

This situation actually confuses grub2, resulting in install failures.
grub2 sees the /proc/device-tree and picks the wrong install target cf.
http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311
grub should be more robust, but still, leaving an empty proc dir seems
pointless.

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Rui Xiang cdf7e8dded proc: set attributes of pde using accessor functions
Use existing accessors proc_set_user() and proc_set_size() to set
attributes.  Just a cleanup.

Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Oleg Nesterov 9f6e963f06 proc: fix ->f_pos overflows in first_tid()
1. proc_task_readdir()->first_tid() path truncates f_pos to int, this
   is wrong even on 64bit.

   We could check that f_pos < PID_MAX or even INT_MAX in
   proc_task_readdir(), but this patch simply checks the potential
   overflow in first_tid(), this check is nop on 64bit.  We do not care if
   it was negative and the new unsigned value is huge, all we need to
   ensure is that we never wrongly return !NULL.

2. Remove the 2nd "nr != 0" check before get_nr_threads(),
   nr_threads == 0 is not distinguishable from !pid_task() above.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Oleg Nesterov d855a4b79f proc: don't (ab)use ->group_leader in proc_task_readdir() paths
proc_task_readdir() does not really need "leader", first_tid() has to
revalidate it anyway.  Just pass proc_pid(inode) to first_tid() instead,
it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if
necessary.

The patch also extracts the "inode is dead" code from
pid_delete_dentry(dentry) into the new trivial helper,
proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT
if this dir was removed.

This is a bit racy, but the race is very inlikely and the getdents() after
openndir() can see the empty "." + ".." dir only once.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Oleg Nesterov c986c14a6a proc: change first_tid() to use while_each_thread() rather than next_thread()
Rerwrite the main loop to use while_each_thread() instead of
next_thread().  We are going to fix or replace while_each_thread(),
next_thread() should be avoided whenever possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Oleg Nesterov 940fe4793a proc: fix the potential use-after-free in first_tid()
proc_task_readdir() verifies that the result of get_proc_task() is
pid_alive() and thus its ->group_leader is fine too.  However this is not
necessarily true after rcu_read_unlock(), we need to recheck this again
after first_tid() does rcu_read_lock().  Otherwise
leader->thread_group.next (used by next_thread()) can be invalid if the
rcu grace period expires in between.

The race is subtle and unlikely, but still it is possible afaics.  To
simplify lets ignore the "likely" case when tid != 0, f_version can be
cleared by proc_task_operations->llseek().

Suppose we have a main thread M and its subthread T.  Suppose that f_pos
== 3, iow first_tid() should return T.  Now suppose that the following
happens between rcu_read_unlock() and rcu_read_lock():

	1. T execs and becomes the new leader. This removes M from
	    ->thread_group but next_thread(M) is still T.

	2. T creates another thread X which does exec as well, T
	   goes away.

	3. X creates another subthread, this increments nr_threads.

	4. first_tid() does next_thread(M) and returns the already
	   dead T.

Note also that we need 2.  and 3.  only because of get_nr_threads() check,
and this check was supposed to be optimization only.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Oleg Nesterov 74e37200de proc: cleanup/simplify get_task_state/task_state_array
get_task_state() and task_state_array[] look confusing and suboptimal, it
is not clear what it can actually report to user-space and
task_state_array[] blows .data for no reason.

1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
   clear. TASK_REPORT is self-documenting but it is not clear
   what ->exit_state can add.

   Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
   into TASK_REPORT and use it to calculate the final result.

2. With the change above it is obvious that task_state_array[]
   has the unused entries just to make BUILD_BUG_ON() happy.

   Change this BUILD_BUG_ON() to use TASK_REPORT rather than
   TASK_STATE_MAX and shrink task_state_array[].

3. Turn the "while (state)" loop into fls(state).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Naoya Horiguchi e3bba3c3c9 fs/proc/page.c: add PageAnon check to surely detect thp
stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not.  But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec.  This happens only
for a few seconds after LRU list operations, but it makes it difficult
to control our applications depending on this flag.

So this patch adds another check PageAnon to detect thps on pagevec.  It
might not give the future extensibility for thp pagecache, but it's OK
at least for now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:50 -08:00
Rik van Riel 34e431b0ae /proc/meminfo: provide estimated available memory
Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available.  They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is
pretty much guaranteed to be wrong today.

It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction
of system memory on mostly idle systems with lots of files.

Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.

However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.

It is more convenient to provide such an estimate in /proc/meminfo.  If
things change in the future, we only have to change it in one place.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Erik Mouw <erik.mouw_2@nxp.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Jan Beulich ae5758a1a7 procfs: also fix proc_reg_get_unmapped_area() for !MMU case
Commit fad1a86e25 ("procfs: call default get_unmapped_area on
MMU-present architectures"), as its title says, took care of only the
MMU case, leaving the !MMU side still in the regressed state (returning
-EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).

From the fad1a86e25 changelog:

 "Commit c4fe244857 ("sparc: fix PCI device proc file mmap(2)") added
  proc_reg_get_unmapped_area in proc_reg_file_ops and
  proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
  get_unmapped_area method is not defined for the target procfs file, which
  causes regression of mmap on /proc/vmcore.

  To address this issue, like get_unmapped_area(), call default
  current->mm->get_unmapped_area on MMU-present architectures if
  pde->proc_fops->get_unmapped_area, i.e.  the one in actual file operation
  in the procfs file, is not defined"

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>	[3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Linus Torvalds 3eaded86ac Merge git://git.infradead.org/users/eparis/audit
Pull audit updates from Eric Paris:
 "Nothing amazing.  Formatting, small bug fixes, couple of fixes where
  we didn't get records due to some old VFS changes, and a change to how
  we collect execve info..."

Fixed conflict in fs/exec.c as per Eric and linux-next.

* git://git.infradead.org/users/eparis/audit: (28 commits)
  audit: fix type of sessionid in audit_set_loginuid()
  audit: call audit_bprm() only once to add AUDIT_EXECVE information
  audit: move audit_aux_data_execve contents into audit_context union
  audit: remove unused envc member of audit_aux_data_execve
  audit: Kill the unused struct audit_aux_data_capset
  audit: do not reject all AUDIT_INODE filter types
  audit: suppress stock memalloc failure warnings since already managed
  audit: log the audit_names record type
  audit: add child record before the create to handle case where create fails
  audit: use given values in tty_audit enable api
  audit: use nlmsg_len() to get message payload length
  audit: use memset instead of trying to initialize field by field
  audit: fix info leak in AUDIT_GET requests
  audit: update AUDIT_INODE filter rule to comparator function
  audit: audit feature to set loginuid immutable
  audit: audit feature to only allow unsetting the loginuid
  audit: allow unsetting the loginuid (with priv)
  audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
  audit: loginuid functions coding style
  selinux: apply selinux checks on new audit message types
  ...
2013-11-21 19:18:14 -08:00
Al Viro b26d4cd385 consolidate simple ->d_delete() instances
Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-15 22:04:17 -05:00
Tetsuo Handa 652586df95 seq_file: remove "%n" usage from seq_file users
All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov cb900f4121 mm, hugetlb: convert hugetlbfs to use split pmd lock
Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov bf929152e9 mm, thp: change pmd_trans_huge_lock() to return taken lock
With split ptlock it's important to know which lock
pmd_trans_huge_lock() took.  This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial.  Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov e1f56c89b0 mm: convert mm->nr_ptes to atomic_long_t
With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Linus Torvalds 5cbb3d216e Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 "Quite a lot of other stuff is banked up awaiting further
  next->mainline merging, but this batch contains:

   - Lots of random misc patches
   - OCFS2
   - Most of MM
   - backlight updates
   - lib/ updates
   - printk updates
   - checkpatch updates
   - epoll tweaking
   - rtc updates
   - hfs
   - hfsplus
   - documentation
   - procfs
   - update gcov to gcc-4.7 format
   - IPC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
  ipc, msg: fix message length check for negative values
  ipc/util.c: remove unnecessary work pending test
  devpts: plug the memory leak in kill_sb
  ./Makefile: export initial ramdisk compression config option
  init/Kconfig: add option to disable kernel compression
  drivers: w1: make w1_slave::flags long to avoid memory corruption
  drivers/w1/masters/ds1wm.cuse dev_get_platdata()
  drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
  drivers/memstick/core/mspro_block.c: fix attributes array allocation
  drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
  kernel/panic.c: reduce 1 byte usage for print tainted buffer
  gcov: reuse kbasename helper
  kernel/gcov/fs.c: use pr_warn()
  kernel/module.c: use pr_foo()
  gcov: compile specific gcov implementation based on gcc version
  gcov: add support for gcc 4.7 gcov format
  gcov: move gcov structs definitions to a gcc version specific file
  kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
  kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
  kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
  ...
2013-11-13 15:45:43 +09:00
Linus Torvalds 9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Randy Dunlap 1c3fc3e5cc kcore: add Kconfig help text
Under Pseudo filesystems, /proc/kcore support has no help.

Fixes a portion of kernel bugzilla #52671:
  https://bugzilla.kernel.org/show_bug.cgi?id=52671

Thanks for David Howells for the help text.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: <lailavrazda1979@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:33 +09:00
HATAYAMA Daisuke 5721cf84d9 procfs: clean up proc_reg_get_unmapped_area for 80-column limit
Clean up proc_reg_get_unmapped_area due to its 80-column limit
violation.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:33 +09:00
Jerome Marchand 00619bcc44 mm: factor commit limit calculation
The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Naoya Horiguchi ec8e41aec1 /proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line
This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
David Rientjes 948927ee9e mm, mempolicy: make mpol_to_str robust and always succeed
mpol_to_str() should not fail.  Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.

If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".

If the buffer is too small, just truncate the string and return, the
same behavior as snprintf().

This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Chen Gang <gang.chen@asianux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:05 +09:00
Xishi Qiu 83285c72e0 mm: use pgdat_end_pfn() to simplify the code in others
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages".  Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:03 +09:00
Linus Torvalds 10d0c9705e DeviceTree updates for 3.13. This is a bit larger pull request than
usual for this cycle with lots of clean-up.
 
 - Cross arch clean-up and consolidation of early DT scanning code.
 - Clean-up and removal of arch prom.h headers. Makes arch specific
   prom.h optional on all but Sparc.
 - Addition of interrupts-extended property for devices connected to
   multiple interrupt controllers.
 - Refactoring of DT interrupt parsing code in preparation for deferred
   probe of interrupts.
 - ARM cpu and cpu topology bindings documentation.
 - Various DT vendor binding documentation updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgPQ4AAoJEMhvYp4jgsXif28H/1WkrXq5+lCFQZF8nbYdE2h0
 R8PsfiJJmAl6/wFgQTsRel+ScMk2hiP08uTyqf2RLnB1v87gCF7MKVaLOdONfUDi
 huXbcQGWCmZv0tbBIklxJe3+X3FIJch4gnyUvPudD1m8a0R0LxWXH/NhdTSFyB20
 PNjhN/IzoN40X1PSAhfB5ndWnoxXBoehV/IVHVDU42vkPVbVTyGAw5qJzHW8CLyN
 2oGTOalOO4ffQ7dIkBEQfj0mrgGcODToPdDvUQyyGZjYK2FY2sGrjyquir6SDcNa
 Q4gwatHTu0ygXpyphjtQf5tc3ZCejJ/F0s3olOAS1ahKGfe01fehtwPRROQnCK8=
 =GCbY
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:
 "DeviceTree updates for 3.13.  This is a bit larger pull request than
  usual for this cycle with lots of clean-up.

   - Cross arch clean-up and consolidation of early DT scanning code.
   - Clean-up and removal of arch prom.h headers.  Makes arch specific
     prom.h optional on all but Sparc.
   - Addition of interrupts-extended property for devices connected to
     multiple interrupt controllers.
   - Refactoring of DT interrupt parsing code in preparation for
     deferred probe of interrupts.
   - ARM cpu and cpu topology bindings documentation.
   - Various DT vendor binding documentation updates"

* tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (82 commits)
  powerpc: add missing explicit OF includes for ppc
  dt/irq: add empty of_irq_count for !OF_IRQ
  dt: disable self-tests for !OF_IRQ
  of: irq: Fix interrupt-map entry matching
  MIPS: Netlogic: replace early_init_devtree() call
  of: Add Panasonic Corporation vendor prefix
  of: Add Chunghwa Picture Tubes Ltd. vendor prefix
  of: Add AU Optronics Corporation vendor prefix
  of/irq: Fix potential buffer overflow
  of/irq: Fix bug in interrupt parsing refactor.
  of: set dma_mask to point to coherent_dma_mask
  of: add vendor prefix for PHYTEC Messtechnik GmbH
  DT: sort vendor-prefixes.txt
  of: Add vendor prefix for Cadence
  of: Add empty for_each_available_child_of_node() macro definition
  arm/versatile: Fix versatile irq specifications.
  of/irq: create interrupts-extended property
  microblaze/pci: Drop PowerPC-ism from irq parsing
  of/irq: Create of_irq_parse_and_map_pci() to consolidate arch code.
  of/irq: Use irq_of_parse_and_map()
  ...
2013-11-12 16:52:17 +09:00
Eric Paris 81407c84ac audit: allow unsetting the loginuid (with priv)
If a task has CAP_AUDIT_CONTROL allow that task to unset their loginuid.
This would allow a child of that task to set their loginuid without
CAP_AUDIT_CONTROL.  Thus when launching a new login daemon, a
priviledged helper would be able to unset the loginuid and then the
daemon, which may be malicious user facing, do not need priv to function
correctly.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-11-05 11:08:09 -05:00
Ingo Molnar fb10d5b7ef Merge branch 'linus' into sched/core
Resolve cherry-picking conflicts:

Conflicts:
	mm/huge_memory.c
	mm/memory.c
	mm/mprotect.c

See this upstream merge commit for more details:

  52469b4fcd Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-01 08:24:41 +01:00
Al Viro 87dc800be2 new helper: kfree_put_link()
duplicated to hell and back...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24 23:34:49 -04:00
HATAYAMA Daisuke fad1a86e25 procfs: call default get_unmapped_area on MMU-present architectures
Commit c4fe244857 ("sparc: fix PCI device proc file mmap(2)") added
proc_reg_get_unmapped_area in proc_reg_file_ops and
proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
get_unmapped_area method is not defined for the target procfs file,
which causes regression of mmap on /proc/vmcore.

To address this issue, like get_unmapped_area(), call default
current->mm->get_unmapped_area on MMU-present architectures if
pde->proc_fops->get_unmapped_area, i.e.  the one in actual file
operation in the procfs file, is not defined.

Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16 21:35:53 -07:00
HATAYAMA Daisuke 2cbe3b0af8 procfs: fix unintended truncation of returned mapped address
Currently, proc_reg_get_unmapped_area truncates upper 32-bit of the
mapped virtual address returned from get_unmapped_area method in
pde->proc_fops due to the variable rv of signed integer on x86_64.  This
is too small to have vitual address of unsigned long on x86_64 since on
x86_64, signed integer is of 4 bytes while unsigned long is of 8 bytes.
To fix this issue, use unsigned long instead.

Fixes a regression added in commit c4fe244857 ("sparc: fix PCI device
proc file mmap(2)").

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16 21:35:53 -07:00
Cyrill Gorcunov e9cdd6e771 mm: /proc/pid/pagemap: inspect _PAGE_SOFT_DIRTY only on present pages
If a page we are inspecting is in swap we may occasionally report it as
having soft dirty bit (even if it is clean).  The pte_soft_dirty helper
should be called on present pte only.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16 21:35:52 -07:00
Rob Herring 32df8dca50 of: remove HAVE_ARCH_DEVTREE_FIXUPS
HAVE_ARCH_DEVTREE_FIXUPS appears to always be needed except for sparc,
but it is only used for /proc/device-teee and sparc does not enable
/proc/device-tree. So this option is redundant. Remove the option and
always enable it. This has the side effect of fixing /proc/device-tree
on arches such as arm64 which failed to define this option.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
2013-10-09 20:04:08 -05:00
Mel Gorman e29cf08b05 sched/numa: Report a NUMA task group ID
It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:49 +02:00
Kirill A. Shutemov 3cd14fcd3f thp: account anon transparent huge pages into NR_ANON_PAGES
We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Michael Holzheu 11e376a3f9 vmcore: enable /proc/vmcore mmap for s390
The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
now to use mmap also on s390.

So enable mmap for s390 again.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:14 -07:00
Michael Holzheu 9cb218131d vmcore: introduce remap_oldmem_pfn_range()
For zfcpdump we can't map the HSA storage because it is only available via
a read interface.  Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped.  For zfcpdump this is everything
besides of the HSA memory.  For the areas that are not mapped by
remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
handler mmap_vmcore_fault() is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:10 -07:00
Michael Holzheu be8a8d069e vmcore: introduce ELF header in new memory feature
For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

The advantages of this mechanism are:

 * No crashkernel memory has to be defined in the old kernel.
 * Early boot problems (before kexec_load has been done) can be dumped
 * Non-Linux systems can be dumped.

We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.

Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.

So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.

The following steps are done during zfcpdump execution:

1.  Production system crashes
2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4.  Boot loader loads kernel into low memory area
5.  Kernel boots and uses only HSA_SIZE bytes of memory
6.  Kernel saves registers of non-boot CPUs
7.  Kernel does memory detection for dump memory map
8.  Kernel creates ELF header for /proc/vmcore
9.  /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
    - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
    - copy_oldmem_page() copies from real memory for memory above HSA_SIZE

Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick.  We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data.  This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.

This patch:

Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.

To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:

 * elfcorehdr_alloc: Allocate ELF header
 * elfcorehdr_free: Free the memory of the ELF header
 * elfcorehdr_read: Read from ELF header
 * elfcorehdr_read_notes: Read from ELF notes

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:10 -07:00
Oleg Nesterov 96d0df79f2 proc: make proc_fd_permission() thread-friendly
proc_fd_permission() says "process can still access /proc/self/fd after it
has executed a setuid()", but the "task_pid() = proc_pid() check only
helps if the task is group leader, /proc/self points to
/proc/<leader-pid>.

Change this check to use task_tgid() so that the whole thread group can
access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd.

Notes:
	- CLONE_THREAD does not require CLONE_FILES so task->files
	  can differ, but I don't think this can lead to any security
	  problem. And this matches same_thread_group() in
	  __ptrace_may_access().

	- /proc/self should probably point to /proc/<thread-tid>, but
	  it is too late to change the rules. Perhaps it makes sense
	  to add /proc/thread though.

Test-case:

	void *tfunc(void *arg)
	{
		assert(opendir("/proc/self/fd"));
		return NULL;
	}

	int main(void)
	{
		pthread_t t;
		pthread_create(&t, NULL, tfunc, NULL);
		pthread_join(t, NULL);
		return 0;
	}

fails if, say, this executable is not readable and suid_dumpable = 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:03 -07:00
Chen Gang a3c039929d fs/proc/task_mmu.c: check the return value of mpol_to_str()
mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
check about it, or buffer may not be zero based, and next seq_printf()
will cause issue.

The failure return need after mpol_cond_put() to match get_vma_policy().

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:03 -07:00
Cyrill Gorcunov d9104d1ca9 mm: track vma changes with VM_SOFTDIRTY bit
Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.

So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in.  When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.

Thus when user space application track memory changes now it can detect if
vma area is renewed.

Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:56 -07:00
Linus Torvalds c7c4591db6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace changes from Eric Biederman:
 "This is an assorted mishmash of small cleanups, enhancements and bug
  fixes.

  The major theme is user namespace mount restrictions.  nsown_capable
  is killed as it encourages not thinking about details that need to be
  considered.  A very hard to hit pid namespace exiting bug was finally
  tracked and fixed.  A couple of cleanups to the basic namespace
  infrastructure.

  Finally there is an enhancement that makes per user namespace
  capabilities usable as capabilities, and an enhancement that allows
  the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns:  Kill nsown_capable it makes the wrong thing easy
  capabilities: allow nice if we are privileged
  pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
  userns: Allow PR_CAPBSET_DROP in a user namespace.
  namespaces: Simplify copy_namespaces so it is clear what is going on.
  pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
  sysfs: Restrict mounting sysfs
  userns: Better restrictions on when proc and sysfs can be mounted
  vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
  kernel/nsproxy.c: Improving a snippet of code.
  proc: Restrict mounting the proc filesystem
  vfs: Lock in place mounts from more privileged users
2013-09-07 14:35:32 -07:00
Linus Torvalds b14662cae0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc changes from David Miller:
 "Several bug fixes (from Kirill Tkhai, Geery Uytterhoeven, and Alexey
  Dobriyan) and some support for Fujitsu sparc64x chips (from Allen
  Pais)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Export flush_ptrace_access() (needed by lustre)
  sparc: fix PCI device proc file mmap(2)
  sparc64: Remove RWSEM export leftovers
  sparc64: Fix off by one in trampoline TLB mapping installation loop.
  sparc64: Fix ITLB handler of null page
  esp_scsi: Fix tag state corruption when autosensing.
  sparc64: Fix not SRA'ed %o5 in 32-bit traced syscall
  sparc64: cleanup: Rename ret_from_syscall to ret_from_fork
  sparc32: Fix exit flag passed from traced sys_sigreturn
  sparc64: Fix wrong syscall return value passed to trace_sys_exit()
  support sparc64x chip type in cpumap.c
  cpu hw caps support for sparc64x
2013-09-05 15:28:17 -07:00
Alexey Dobriyan c4fe244857 sparc: fix PCI device proc file mmap(2)
Commit 786d7e1612 "Fix rmmod/read/write races in /proc entries"
must have broken mmapping of PCI device proc files on Sparc.

Notice how it adds wrapper around ->mmap but doesn't do it around ->get_unmapped_area.
Add wrapper around ->get_unmapped_area.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 12:12:51 -07:00
Eric W. Biederman e51db73532 userns: Better restrictions on when proc and sysfs can be mounted
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify that the mounted filesystem is not covered in any significant
way.  I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.

Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs.  This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26 19:17:03 -07:00
Eric W. Biederman aee1c13dd0 proc: Restrict mounting the proc filesystem
Don't allow mounting the proc filesystem unless the caller has
CAP_SYS_ADMIN rights over the pid namespace.  The principle here is if
you create or have capabilities over it you can mount it, otherwise
you get to live with what other people have mounted.

Andy pointed out that this is needed to prevent users in a user
namespace from remounting proc and specifying different hidepid and gid
options on already existing proc mounts.

Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26 11:36:58 -07:00
Linus Torvalds 4d4323ea2d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes from the last week or so"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: collect_mounts() should return an ERR_PTR
  bfs: iget_locked() doesn't return an ERR_PTR
  efs: iget_locked() doesn't return an ERR_PTR()
  proc: kill the extra proc_readfd_common()->dir_emit_dots()
  cope with potentially long ->d_dname() output for shmem/hugetlb
2013-08-25 12:25:38 -07:00
Oleg Nesterov a5a1955e0c proc: kill the extra proc_readfd_common()->dir_emit_dots()
proc_readfd_common() does dir_emit_dots() twice in a row,
we need to do this only once.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:22 -04:00
Linus Torvalds fd3930f70c proc: more readdir conversion bug-fixes
In the previous commit, Richard Genoud fixed proc_root_readdir(), which
had lost the check for whether all of the non-process /proc entries had
been returned or not.

But that in turn exposed _another_ bug, namely that the original readdir
conversion patch had yet another problem: it had lost the return value
of proc_readdir_de(), so now checking whether it had completed
successfully or not didn't actually work right anyway.

This reinstates the non-zero return for the "end of base entries" that
had also gotten lost in commit f0c3b5093a ("[readdir] convert
procfs").  So now you get all the base entries *and* you get all the
process entries, regardless of getdents buffer size.

(Side note: the Linux "getdents" manual page actually has a nice example
application for testing getdents, which can be easily modified to use
different buffers.  Who knew? Man-pages can be useful)

Reported-by: Emmanuel Benisty <benisty.e@gmail.com>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19 16:26:12 -07:00
Richard Genoud 94fc5d9de5 proc: return on proc_readdir error
Commit f0c3b5093a ("[readdir] convert procfs") introduced a bug on the
listing of the proc file-system.  The return value of proc_readdir()
isn't tested anymore in the proc_root_readdir function.

This lead to an "interesting" behaviour when we are using the getdents()
system call with a buffer too small: instead of failing, it returns the
first entries of /proc (enough to fill the given buffer), plus the PID
directories.

This is not triggered on glibc (as getdents is called with a 32KB
buffer), but on uclibc, the buffer size is only 1KB, thus some proc
entries are missing.

See https://lkml.org/lkml/2013/8/12/288 for more background.

Signed-off-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19 09:47:27 -07:00
yonghua zheng 8c8296223f fs/proc/task_mmu.c: fix buffer overflow in add_page_map()
Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR.  After debuggind we found this has something
to do with following bug in pagemap:

In struct pagemapread:

  struct pagemapread {
      int pos, len;
      pagemap_entry_t *buffer;
      bool v2;
  };

pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.

Correct len to be total number of PM_ENTRY_BYTES in buffer.

[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng <younghua.zheng@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:50 -07:00
Cyrill Gorcunov 41bb3476b3 mm: save soft-dirty bits on file pages
Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry.  Thus when #pf happens on such non-present pte
we can restore it back.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:48 -07:00
Cyrill Gorcunov 179ef71cbc mm: save soft-dirty bits on swapped pages
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.

To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out.  When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.

One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap.  The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:47 -07:00
Michael Holzheu 5a74953ff5 s390/kdump: Disable mmap for s390
The kdump mmap patch series (git commit 83086978c6) directly
map the PT_LOADs to memory. On s390 this does not work because the
copy_from_oldmem() function swaps [0,crashkernel size] with
[crashkernel base, crashkernel base+crashkernel size]. The swap
int copy_from_oldmem() was done in order correctly implement /dev/oldmem.

See: http://marc.info/?l=kexec&m=136940802511603&w=2

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-07-18 13:40:18 +02:00
Zhao Hongjiang 30bc30df10 fs/proc/kcore.c: using strlcpy() instead of strncpy()
For NUL terminated string, set '\0' at the end.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:02 -07:00
Oleg Nesterov 1d98a5fa11 fs/proc/uptime.c:uptime_proc_show(): use get_monotonic_boottime()
Change uptime_proc_show() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:02 -07:00
HATAYAMA Daisuke 83086978c6 vmcore: support mmap() on /proc/vmcore
This patch introduces mmap_vmcore().

Don't permit writable nor executable mapping even with mprotect()
because this mmap() is aimed at reading crash dump memory.  Non-writable
mapping is also requirement of remap_pfn_range() when mapping linear
pages on non-consecutive physical pages; see is_cow_mapping().

Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
remap_vmalloc_range_pertial at the same time for a single vma.
do_munmap() can correctly clean partially remapped vma with two
functions in abnormal case.  See zap_pte_range(), vm_normal_page() and
their comments for details.

On x86-32 PAE kernels, mmap() supports at most 16TB memory only.  This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

[akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Tested-by: Maxim Uvarov <muvarov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 591ff71664 vmcore: calculate vmcore file size from buffer size and total size of vmcore objects
The previous patches newly added holes before each chunk of memory and
the holes need to be count in vmcore file size.  There are two ways to
count file size in such a way:

1) suppose m is a poitner to the last vmcore object in vmcore_list.
   Then file size is (m->offset + m->size), or

2) calculate sum of size of buffers for ELF header, program headers,
   ELF note segments and objects in vmcore_list.

Although 1) is more direct and simpler than 2), 2) seems better in that
it reflects internal object structure of /proc/vmcore.  Thus, this patch
changes get_vmcore_size_elf{64, 32} so that it calculates size in the
way of 2).

As a result, both get_vmcore_size_elf{64, 32} have the same definition.
Merge them as get_vmcore_size.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke ef9e78fd27 vmcore: allow user process to remap ELF note segment buffer
Now ELF note segment has been copied in the buffer on vmalloc memory.
To allow user process to remap the ELF note segment buffer with
remap_vmalloc_page, the corresponding VM area object has to have
VM_USERMAP flag set.

[akpm@linux-foundation.org: use the conventional comment layout]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 087350c9dc vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory
The reasons why we don't allocate ELF note segment in the 1st kernel
(old memory) on page boundary is to keep backward compatibility for old
kernels, and that if doing so, we waste not a little memory due to
round-up operation to fit the memory to page boundary since most of the
buffers are in per-cpu area.

ELF notes are per-cpu, so total size of ELF note segments depends on
number of CPUs.  The current maximum number of CPUs on x86_64 is 5192,
and there's already system with 4192 CPUs in SGI, where total size
amounts to 1MB.  This can be larger in the near future or possibly even
now on another architecture that has larger size of note per a single
cpu.  Thus, to avoid the case where memory allocation for large block
fails, we allocate vmcore objects on vmalloc memory.

This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
to the ELF note segment buffer and its size.  There's no longer the
vmcore object that corresponds to the ELF note segment in vmcore_list.
Accordingly, read_vmcore() has new case for ELF note segment and
set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
calculating offset from sum of size of ELF headers and size of ELF note
segment.

[akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke 7f614cd1e0 vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list.  Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

This change affects layout of /proc/vmcore.  The gaps generated by the
rearrangement are newly made visible to applications as holes.
Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
[end, roundup(end, PAGE_SIZE)].

Suppose variable m points at a vmcore object in vmcore_list, and
variable phdr points at the program header of PT_LOAD type the variable
m corresponds to.  Then, pictorially:

  m->offset                    +---------------+
                               | hole          |
phdr->p_offset =               +---------------+
  m->offset + (paddr - start)  |               |\
                               | kernel memory | phdr->p_memsz
                               |               |/
                               +---------------+
                               | hole          |
  m->offset + m->size          +---------------+

where m->offset and m->offset + m->size are always page-size aligned.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke f2bdacdd59 vmcore: allocate buffer for ELF headers on page-size alignment
Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().

Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used.  Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually used
buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.

The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.

The merged, removed PT_NOTE entries, i.e.  the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.

Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that the
offset includes the holes towards the page boundary.

As a result, both set_vmcore_list_offsets_elf{64,32} have the same
definition.  Merge them as set_vmcore_list_offsets.

[akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
HATAYAMA Daisuke b27eb18660 vmcore: clean up read_vmcore()
Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:30 -07:00
Pavel Emelyanov 541c237c09 pagemap: prepare to reuse constant bits with page-shift
In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits
55-60 are about to change in a couple of releases.  Next, if a user
issues soft-dirty clear command via the clear_refs file (it was disabled
before v3.9) we assume that he's aware of the new pagemap format, note
that fact and report the bits in pagemap in the new manner.

The "migration strategy" looks like this then:

1. existing users are not affected -- they don't touch soft-dirty feature, thus
   see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
   this trick with clear-soft-dirty affecting pagemap format.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:26 -07:00
Pavel Emelyanov 0f8975ec4d mm: soft-dirty bits for user memory changes tracking
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to.  In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is.  Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast.  This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:26 -07:00
Pavel Emelyanov 2b0a9f0175 pagemap: introduce pagemap_entry_t without pmshift bits
These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry.  Moreover, in next patch we will need to report one more bit
in the pagemap, but all bits are already busy on it.

That said, describe the pagemap entry that has 6 more free zero bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Pavel Emelyanov af9de7eb18 clear_refs: introduce private struct for mm_walk
In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry
this info.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Pavel Emelyanov 040fa02077 clear_refs: sanitize accepted commands declaration
This is the implementation of the soft-dirty bit concept that should
help keep track of changes in user memory, which in turn is very-very
required by the checkpoint-restore project (http://criu.org).

To create a dump of an application(s) we save all the information about
it to files, and the biggest part of such dump is the contents of tasks'
memory.  However, there are usage scenarios where it's not required to
get _all_ the task memory while creating a dump.  For example, when
doing periodical dumps, it's only required to take full memory dump only
at the first step and then take incremental changes of memory.  Another
example is live migration.  We copy all the memory to the destination
node without stopping all tasks, then stop them, check for what pages
has changed, dump it and the rest of the state, then copy it to the
destination node.  This decreases freeze time significantly.

That said, some help from kernel to watch how processes modify the
contents of their memory is required.

The proposal is to track changes with the help of new soft-dirty bit
this way:

1. First do "echo 4 > /proc/$pid/clear_refs".
   At that point kernel clears the soft dirty _and_ the writable bits from all
   ptes of process $pid. From now on every write to any page will result in #pf
   and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
   the soft dirty flag.

2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
   (the 55'th one). If set, the respective pte was written to since last call
   to clear refs.

The soft-dirty bit is the _PAGE_BIT_HIDDEN one.  Although it's used by
kmemcheck, the latter one marks kernel pages with it, while the former
bit is put on user pages so they do not conflict to each other.

This patch:

A new clear-refs type will be added in the next patch, so prepare
code for that.

[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Linus Torvalds da53be12bb Don't pass inode to ->d_hash() and ->d_compare()
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:36 +04:00
Al Viro 1df98b8bbc proc_fill_cache(): clean up, get rid of pointless find_inode_number() use
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:19 +04:00
Al Viro c52a47ace7 proc_fill_cache(): just make instantiate_t return int
all instances always return ERR_PTR(-E...) or NULL, anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:18 +04:00
Al Viro db96316487 proc_pid_readdir(): stop wanking with proc_fill_cache() for /proc/self
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:17 +04:00
Al Viro 147ce69974 proc_fill_cache(): kill pointless check
we'd just checked that child->d_inode is non-NULL, for fuck sake!

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:16 +04:00
Al Viro f0c3b5093a [readdir] convert procfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
Kees Cook 637241a900 kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections.  Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions.  With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.

To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:

 - /proc/kmsg allows:
  - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
  - everything, after an open.

 - syslog syscall allows:
  - anything, if CAP_SYSLOG.
  - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
  - nothing else (EPERM).

The use-cases were:
 - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
 - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
   destructive SYSLOG_ACTION_READs.

AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.

Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.

To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.

 - /dev/kmsg allows:
  - open if CAP_SYSLOG or dmesg_restrict==0
  - reading/polling, after open

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Pavel Tikhomirov 15ef0298de posix-timers: Show clock ID in proc file
Expand information about posix-timers in /proc/<pid>/timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: <clock_id>".

Signed-off-by: Pavel Tikhomirov <snorcht@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snorcht@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-28 11:41:14 +02:00
Linus Torvalds 0f47c9423c Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
Pull slab changes from Pekka Enberg:
 "The bulk of the changes are more slab unification from Christoph.

  There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
  the mix."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
  mm, slab_common: Fix bootstrap creation of kmalloc caches
  slab: Return NULL for oversized allocations
  mm: slab: Verify the nodeid passed to ____cache_alloc_node
  slub: tid must be retrieved from the percpu area of the current processor
  slub: Do not dereference NULL pointer in node_match
  slub: add 'likely' macro to inc_slabs_node()
  slub: correct to calculate num of acquired objects in get_partial_node()
  slub: correctly bootstrap boot caches
  mm/sl[au]b: correct allocation type check in kmalloc_slab()
  slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
  slab: Handle ARCH_DMA_MINALIGN correctly
  slab: Common definition for kmem_cache_node
  slab: Rename list3/l3 to node
  slab: Common Kmalloc cache determination
  stat: Use size_t for sizes instead of unsigned
  slab: Common function to create the kmalloc array
  slab: Common definition for the array of kmalloc caches
  slab: Common constants for kmalloc boundaries
  slab: Rename nodelists to node
  slab: Common name for the per node structures
  ...
2013-05-07 08:42:20 -07:00
Pekka Enberg 69df2ac128 Merge branch 'slab/next' into slab/for-linus 2013-05-07 09:19:47 +03:00
Syam Sidhardhan 75fc0cf6af proc_devtree: Replace include linux/module.h with linux/export.h
Since it uses only THIS_MODULE macro, include <linux/export.h>
is the right to go here.

Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04 15:31:01 -04:00
Linus Torvalds 20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
David Howells 59d8053f1e proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
Move non-public declarations and definitions from linux/proc_fs.h to
fs/proc/internal.h.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:47 -04:00
David Howells c30480b92c proc: Make the PROC_I() and PDE() macros internal to procfs
Make the PROC_I() and PDE() macros internal to procfs.  This means making
PDE_DATA() out of line.  This could be made more optimal by storing
PDE()->data into inode->i_private.

Also provide a __PDE_DATA() that is inline and internal to procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:47 -04:00
David Howells a8ca16ea7b proc: Supply a function to remove a proc entry by PDE
Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Grant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
Al Viro 8d8b97ba49 take cgroup_open() and cpuset_open() to fs/proc/base.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
David Howells 4a520d2769 proc: Supply an accessor for getting the data from a PDE's parent
Supply an accessor function for getting the private data from the parent
proc_dir_entry struct of the proc_dir_entry struct associated with an inode.

ReiserFS, for instance, stores the super_block pointer in the proc directory
it makes for that super_block, and a pointer to the respective seq_file show
function in each of the proc files in that directory.

This allows a reduction in the number of file_operations structs, open
functions and seq_operations structs required.  The problem otherwise is that
each show function requires two pieces of data but only has storage for one
per PDE (and this has no release function).

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: Maxim Mikityanskiy <maxtram95@gmail.com>
cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
cc: linux-wireless@vger.kernel.org
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:42 -04:00