Commit Graph

399394 Commits

Author SHA1 Message Date
Al Viro
8061a6fa56 9p: don't forget to destroy inode cache if fscache registration fails
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-17 22:31:01 -04:00
Al Viro
03da633aa7 atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-17 17:08:50 -04:00
Miklos Szeredi
116cc02253 vfs: don't set FILE_CREATED before calling ->atomic_open()
If O_CREAT|O_EXCL are passed to open, then we know that either

 - the file is successfully created, or
 - the operation fails in some way.

So previously we set FILE_CREATED before calling ->atomic_open() so the
filesystem doesn't have to.  This, however, led to bugs in the
implementation that went unnoticed when the filesystem didn't check for
existence, yet returned success.  To prevent this kind of bug, require
filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
verify this in the VFS.

Also added a couple more verifications for the result of atomic_open():

 - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
 - Warn if filesystem set FILE_CREATED but gave a negative dentry.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi
01c919abaf nfs: set FILE_CREATED
Set FILE_CREATED on O_CREAT|O_EXCL.  If the NFS server honored our request
for exclusivity then this must be correct.

Currently this is a no-op, since the VFS sets FILE_CREATED anyway.  The
next patch will, however, require this flag to be always set by
filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi
c5bf8fef52 gfs2: set FILE_CREATED
In gfs2_create_inode() set FILE_CREATED in *opened.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi
dfb1d61b0e cifs: fix filp leak in cifs_atomic_open()
If an error occurs after having called finish_open() then fput() needs to
be called on the already opened file.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steve French <sfrench@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi
0854d450e2 vfs: improve i_op->atomic_open() documentation
Fix documentation of ->atomic_open() and related functions: finish_open()
and finish_no_open().  Also add details that seem to be unclear and a
source of bugs (some of which are fixed in the following series).

Cc-ing maintainers of all filesystems implementing ->atomic_open().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Steve French <sfrench@samba.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Al Viro
606035e76e autofs4: close the races around autofs4_notify_daemon()
Don't drop ->wq_mutex before calling autofs4_notify_daemon() only to regain it
there.  Besides being pointless, that opens a race window where autofs4_wait_release()
could've come and freed wq->name.name.  And do the debugging printk in the "reused an
existing wq" case before dropping ->wq_mutex - the same reason...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Ian Kent <raven@themaw.net>
2013-09-16 19:16:38 -04:00
Linus Torvalds
3711d86a2d a trivial writeback fix
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMwgMAAoJECvKgwp+S8JaiJUP/RGA98MkWnl5eio9mG5eEbF/
 DC6bP5UOzPo+6oZbwH4LTc4EB04q728SSOU1nG6q1yfuSF0I1Kzt/Um6aS3P5wdk
 okyYW1SjieE0xpmfQpvMEX6TZ7L/FpYjAg47GI0TaJMUdKRmJK0fkZ22hfv6uJzr
 PMVmdJKKgxs85usrn4JyNY93xpKZgncJVuwpfFF1k9oSNIXHAk7OxT7JWj51UdqP
 k/L/HXNhT3MRVvsjyqURHMIXfqRvqcgn47LAkM/IYVdgaFkpLPvwp8RZr/CcKr7U
 KqJsQqqegRyoQ73yqgWXGAGLLXujKllsfKLu/d0vtqY2J4z6lHKTcRGpAGCDyH+3
 bLe4hk+/d+Tz0xBSPaHryy/4yiQ4O+h9rLZCwGdxMX1duoqvThL9S8fLoUkrNBai
 OU7cd4iWPlCmiquATjk0bgthCcKw3wlg+rsiSzUcaO3JbdwTp8P45Mie0ZtZ5jpa
 UcczrT6osOAAswoEPMMeySQ+BVLewSPwmYKaETniYXB5Bb/IHkliX1MkXnA1D9bI
 DNijgB2g2561BVhdkDHf2q8D4Cbrq6UhK7plATB90DB7bwNaAxmtRVJ3zDaQGKOM
 VWBbloNf5QcodshEttj9ZLko7JNF/DjNOcNomb5ZtzY+EGzMksUHBUMPld3yOcna
 LTNApshhbx92MemJ02FC
 =FB22
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback fix from Wu Fengguang:
 "A trivial writeback fix"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Do not sort b_io list only because of block device inode
2013-09-13 23:06:40 -04:00
Linus Torvalds
89dc77bcda vfs: fix dentry LRU list handling and nr_dentry_unused accounting
The LRU list changes interacted badly with our nr_dentry_unused
accounting, and even worse with the new DCACHE_LRU_LIST bit logic.

This introduces helper functions to make sure everything follows the
proper dcache d_lru list rules: the dentry cache is complicated by the
fact that some of the hotpaths don't even want to look at the LRU list
at all, and the fact that we use the same list entry in the dentry for
both the LRU list and for our temporary shrinking lists when removing
things from the LRU.

The helper functions temporarily have some extra sanity checking for the
flag bits that have to match the current LRU state of the dentry.  We'll
remove that before the final 3.12 release, but considering how easy it
is to get wrong, this first cleanup version has some very particular
sanity checking.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13 22:55:10 -04:00
Linus Torvalds
bdbdfdef57 Some more low risk cleanup patches:
Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han
 Fix return values in several drivers from Sachin Kamat
 Remove redundant break in amc6821 driver from Sachin Kamat
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMeiEAAoJEMsfJm/On5mBi78P/jBVis7/4KDVFV1KAy1NhzAD
 EzxRN7X2sM3Iihj1wjMflcW6k9JSzIbC/Q82ouvVNzlgOZu85yoi926p0PIH7A/g
 E/X4hlictBzTsc9SjTGio9E6f59wFroBSDnquEo6JRCgdX9DqeaGzh8W/0PR5wSh
 SkZgXGoXxAK1fH26UZ1sXf2k4y4d6xNGVAv/HIb3jHu1EhDTb0cHzg+5m+qvdSuD
 5RBfVJMEiJITCHAhP9IdiWP6k8iePLnaBaR77J6vAxWrSBD6a3D4loDkd/FguqGJ
 svkyldm5CB0bhHCMEP6IwGNeyr1W/D2OcueYmMDc99HIY+zAOjhAnge1skBIr6A/
 hmNf7UT3xH0AQMp4I1ALgdu1Ie95lq3COGXwKy1ir/BVCYAeWC1MWb7qtmRcxqyt
 pwc9DHaplIRLoKM+CJEwV6o+gaP9L6+BiBn15g5Rv0RA7vvpl0d/Yrjx4ChWL06K
 paQcxvML/WUjq5uBIFzJmpfXhsUYcMC+YlPRwYhmEql0nTAFWnLejQrIMeSiH0VW
 ADWlN6DFJz75B7HnVnP9jk4H0ljMS9FJBhC2IkH1pDBrtLZs8ChCGXc4md0LwT7T
 1yB2AQGUhb/izDXlw/0I7Q7mJYgYfggaXg1LkRWrV9zjetvzwnArMQ/K4AOiV6yo
 Kxx5tUAkMaFzobqowgvQ
 =Lcfb
 -----END PGP SIGNATURE-----

Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
 "Some more low risk cleanup patches:

   - Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han
   - Fix return values in several drivers from Sachin Kamat
   - Remove redundant break in amc6821 driver from Sachin Kamat"

* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (k10temp) remove unnecessary pci_set_drvdata()
  hwmon: (tmp421) Fix return value
  hwmon: (amc6821) Remove redundant break
  hwmon: (amc6821) Fix return value
  hwmon: (ibmaem) Fix return value
  hwmon: (emc2103) Fix return value
2013-09-13 10:58:41 -07:00
Linus Torvalds
6700215140 Xtensa patchset for v3.12
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMiG+AAoJEI9vqH3mFV2sLDwP/04Zt2Wvurdwd3tAW2fgJ3c/
 RJ2nQwt1w/YhMVUz/52QOtFiOrYF8fldpS3+51FphRiBPZa9oWPafCGWLotMnnfB
 myVeJ9xArscVpzLCAMONaBazE39/HoHGHtsRSjn8WGymbDByIH8PDwA6zSX9zPTu
 HuwdAH+x40wEEN6zFxQJyS4tdnHszOrJfHozwYZkSuXApHUkfBxxRQ5teV5u7ozF
 PSRfuNjiKs9BfDobhU7olIGx+ccUspYY695B9i+ChTNkgVZDSz+HKymYAzCMuPUr
 z++1qJCp5jR08/48X2UMwevpZr9NuR7Xf1hFGZ/tplCx0DBaTYi4sotviKPINp8R
 GuVH7SMkVdR4SdarigfoRpKSB/RZ7PvyfAP5bFfFTc8gQR8R8VLLQDp+9D2j2aeU
 BKxUVFGgXj65hEQaiTJrXXNrSciGE7I64CBGgmmvOGjo5pD8m9hcRaD2HhHontdr
 N/aM6ryRxadssoFoeo3KXVhnm0X7AxuIjYWexnc7BR3w7lG2VA6hh0DSuI5B1h05
 E/oWIsZWLseSojCuPIpPTpgFidx5lG4KYBA/irz5wi2bsFwVkVzGTNFzKe4Vaki2
 R4FxBVan7NuxEcS2gjBhkonPKlyCiTWLFGQcrzNY75sDIASmzpBQWjVe8J12Z+T7
 V3z8DwIcJuVdZcoyKRth
 =O8Zr
 -----END PGP SIGNATURE-----

Merge tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux

Pull Xtensa updates from Chris Zankel.

* tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux:
  xtensa: Fix broken allmodconfig build
  xtensa: remove CCOUNT_PER_JIFFY
  xtensa: fix !CONFIG_XTENSA_CALIBRATE_CCOUNT build failure
  xtensa: don't use echo -e needlessly
  xtensa: new fast_alloca handler
  xtensa: keep a3 and excsave1 on entry to exception handlers
  xtensa: enable kernel preemption
  xtensa: check thread flags atomically on return from user exception
2013-09-13 10:57:48 -07:00
Linus Torvalds
9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Linus Torvalds
399a946edb Merge branch 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull generic hardirq option removal from Martin Schwidefsky:
 "All architectures now use generic hardirqs, s390 has been last to
  switch.

  With that the code under !CONFIG_GENERIC_HARDIRQS and the related
  HAVE_GENERIC_HARDIRQS and GENERIC_HARDIRQS config options can be
  removed.  Yay!"

* 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  Remove GENERIC_HARDIRQ config option
2013-09-13 07:31:38 -07:00
Linus Torvalds
183c420323 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig fix from Michal Marek:
 "This is a fix for a regression caused by my previous pull request.

  A sed command in scripts/config that used colons as separator was
  accidentally changed to use slashes, which fails when you use slashes
  in a value.  Changing it back to colons is of course not a proper fix,
  but at least it will be broken in the same way it had been for four
  years.  A proper fix is pending"

* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  scripts/config: fix variable substitution command
2013-09-13 07:30:17 -07:00
Linus Torvalds
951a730af4 blackfin updates for Linux 3.12
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMoABAAoJEJommM3PjknHeK0QAMCWvahMa3bHu1lPbvYoUKZ8
 lFYWye3RLpXQQgLcKTfVH/qoci1I4ssr5MmChZ88TmAZgojRlPk95rBu0pX08+dI
 6Ro6rfAYCd6LT06YFb1hOYzBkqmz2SCY1R+MKLPu2kzTC06lnd+iF2sEpBpyBSTq
 YyW42ZRgOcGf+/iqiBUo112ZbP2V00jQIeQNyvgwF7GKy+lx86SYxzsIukteSCgI
 W1pNNUnshXT9sH8tGLLtHEkAkYzSkL0mDLdpztkYKiVqXUaSZAz2jfz6CDqfTiNj
 i+wqTG02NN8lMSH8no8Eko9svzuGAmQVYQiSCL2y0Xesy4P2HW2B5uzVD0oKQXp3
 dKAmzUlhoSakAcq/6Rf11HYNYfeCN0T1VqnDt4U/OrIIq/WbK0qPsRtawYw0A5tZ
 4uOCZqxfcUW4y0y6TXBRToFb4Fa0vmGX3WGi6DnW6wU1PaEnW14tkqqvCOA9EUHr
 SZePAkPldRqLxCaJkhqS5eh6SlkObO81Nxtq7D0a1KkT6e2pYToq7QPKECitfs4U
 q5Q6PxKbxrLBRQRIYmU62dQYGsMinwaqjkOsyU0+e89iA4NGu+vq/SVRgYWiSxWF
 BHFW+OBZdYvhyTVYRF5ruSPdMxR6uUVSgMwYPYUUKaXgJ8qV9zaRhrFtdsYmweVc
 AVH7Iay5w9WSO7WghTTS
 =Shbz
 -----END PGP SIGNATURE-----

Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux

Pull blackfin updates from Steven Miao.

* tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux:
  blackfin: Ignore generated uImages
  blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x.
  bf609: adv7343: add S-Video and Component output support
  bf609: add adv7343 video encoder support
  clock: add stmmac clock for ethernet driver
  blackfin: scb: Add SCB1 to SCB9 config options and data.
  blackfin: scb: Add system crossbar init code.
2013-09-13 07:23:49 -07:00
Linus Torvalds
0898d2aa9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes a 7+ year race condition in the crypto API that causes
  sporadic crashes when multiple threads load the same algorithm.

  It also fixes the crct10dif algorithm again to prevent boot failures
  on systems where the initramfs tool ignores module softdeps"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: crct10dif - Add fallback for broken initrds
  crypto: api - Fix race condition in larval lookup
2013-09-13 07:11:14 -07:00
Martin Schwidefsky
0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Clement Chauplannaz
86eb781889 scripts/config: fix variable substitution command
Commit 229455bc02b87f7128f190c4491b4ceffff38648 accidentally changed the
separator between sed `s' command and its parameters from ':' to '/'.

Revert this change.

Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Clement Chauplannaz <chauplac@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2013-09-13 13:06:59 +02:00
Mark Brown
08b67faa23 blackfin: Ignore generated uImages
We have the build infrastructure to generate uImages so we should ignore
the resulting generated files.

Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
2013-09-13 10:42:39 +08:00
Sonic Zhang
1d899fd652 blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x.
- Enable GMAC
- Set propler DMA PBL
- Disable DMA store and forward mode
- Select PTP input clock from MII
clock.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-13 10:42:38 +08:00
Scott Jiang
e57860929c bf609: adv7343: add S-Video and Component output support
Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>
2013-09-13 10:42:36 +08:00
Scott Jiang
4940c53d26 bf609: add adv7343 video encoder support
Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com>
2013-09-13 10:42:34 +08:00
Steven Miao
3036dccf2c clock: add stmmac clock for ethernet driver
Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-13 10:42:32 +08:00
Sonic Zhang
206f060c21 blackfin: scb: Add SCB1 to SCB9 config options and data.
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
2013-09-13 10:42:31 +08:00
Steven Miao
24a70cf2b2 blackfin: scb: Add system crossbar init code.
If SCB exists in select blackfin cpu, developer can change the SCB
priority in kernel configuration.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Steven Miao <realmz6@gmail.com>
2013-09-13 10:42:27 +08:00
Linus Torvalds
5a7d8a2808 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
 "This has been sitting in -next for a while with no objections and all
  MIPS defconfigs except one are building fine; that one platform got
  broken by another patch in your tree and I'm going to submit a patch
  separately.

   - a handful of fixes that didn't make 3.11
   - a few bits of Octeon 3 support with more to come for a later
     release
   - platform enhancements for Octeon, ath79, Lantiq, Netlogic and
     Ralink SOCs
   - a GPIO driver for the Octeon
   - some dusting off of the DECstation code
   - the usual dose of cleanups"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (65 commits)
  MIPS: DMA: Fix BUG due to smp_processor_id() in preemptible code
  MIPS: kexec: Fix random crashes while loading crashkernel
  MIPS: kdump: Skip walking indirection page for crashkernels
  MIPS: DECstation HRT calibration bug fixes
  MIPS: Export copy_from_user_page() (needed by lustre)
  MIPS: Add driver for the built-in PCI controller of the RT3883 SoC
  MIPS: DMA: For BMIPS5000 cores flush region just like non-coherent R10000
  MIPS: ralink: Add support for reset-controller API
  MIPS: ralink: mt7620: Add cpu-feature-override header
  MIPS: ralink: mt7620: Add spi clock definition
  MIPS: ralink: mt7620: Add wdt clock definition
  MIPS: ralink: mt7620: Improve clock frequency detection
  MIPS: ralink: mt7620: This SoC has EHCI and OHCI hosts
  MIPS: ralink: mt7620: Add verbose ram info
  MIPS: ralink: Probe clocksources from OF
  MIPS: ralink: Add support for systick timer found on newer ralink SoC
  MIPS: ralink: Add support for periodic timer irq
  MIPS: Netlogic: Built-in DTB for XLP2xx SoC boards
  MIPS: Netlogic: Add support for USB on XLP2xx
  MIPS: Netlogic: XLP2xx update for I2C controller
  ...
2013-09-12 16:14:49 -07:00
Linus Torvalds
e0ea4045bc xfs: update #2 for v3.12-rc1
Here we have defrag support for v5 superblock, a number of bugfixes and
 a cleanup or two.
 
 - defrag support for CRC filesystems
 - fix endian worning in xlog_recover_get_buf_lsn
 - fixes for sparse warnings
 - fix for assert in xfs_dir3_leaf_hdr_from_disk
 - fix for log recovery of remote symlinks
 - fix for log recovery of btree root splits
 - fixes formemory allocation failures with ACLs
 - fix for assert in xfs_buf_item_relse
 - fix for assert in xfs_inode_buf_verify
 - fix an assignment in an assert that should be a test in
   xfs_bmbt_change_owner
 - remove dead code in xlog_recover_inode_pass2
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMjQUAAoJENaLyazVq6ZOu2IP/1OHZYy+Bkmj0tO9pdsdEa4s
 w4FEBPsQePMJPjwdN693rKpW1exZue5sUmPMErH3ENzc2DPAwpUAlc9XAIohtdFx
 rTqrz2q+qTfZTq8oYBIA/RCOifJ2cHWN8tDYZPJpp5wceV7CRGYQeR1foiudE3ZH
 QDIPXioy8P9IkfGaXCtr/iWf9kycMO2lgNTNfdL6qtwX99HCqHZanTlsWx1BIYGQ
 Fa5TaOsXis6idPMCFMuEC15iEwA+YXc0HmXuHkMFLj+9mwFc4h/Aq65bwUkYZLmy
 +T1Wo/uQ/21rl6im/rWqgCh6fFS8NJQp8NIJeCIyihUEHbarfPyJIJRJjoP457YO
 cv8OkixCkt4zX6CkTxaL5ZFEBW9FYbRb13Gg96J6hb4WfdAFMtQg7FAjThSU/+Qr
 HwjaAso3GXimEaZD1C3c0TtZEQ0x9E6pENVI7/ewB1I0p92p7GJBMq4C7CTAYThV
 5zhdcOnViSrJTJvVQxm+gfOYzubkWWiVmbVku3RCO6//kvPBOvJ9juSYsl0mKeRu
 v2DZZB3AYJE/qnbYfZBlktX9obE6k+keKF6w8Eiufr2IqwJaqfaM4h9eogzAwTJA
 vyXKeLxUEmgHuqivFSZjw3sEK6sY654GCMMTP+2IpD19vlAIioYXdgp0ZbkkdiE3
 6twrzdFZAr1zy80xlM8W
 =2Uq6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull xfs update #2 from Ben Myers:
 "Here we have defrag support for v5 superblock, a number of bugfixes
  and a cleanup or two.

   - defrag support for CRC filesystems
   - fix endian worning in xlog_recover_get_buf_lsn
   - fixes for sparse warnings
   - fix for assert in xfs_dir3_leaf_hdr_from_disk
   - fix for log recovery of remote symlinks
   - fix for log recovery of btree root splits
   - fixes formemory allocation failures with ACLs
   - fix for assert in xfs_buf_item_relse
   - fix for assert in xfs_inode_buf_verify
   - fix an assignment in an assert that should be a test in
     xfs_bmbt_change_owner
   - remove dead code in xlog_recover_inode_pass2"

* tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: remove dead code from xlog_recover_inode_pass2
  xfs: = vs == typo in ASSERT()
  xfs: don't assert fail on bad inode numbers
  xfs: aborted buf items can be in the AIL.
  xfs: factor all the kmalloc-or-vmalloc fallback allocations
  xfs: fix memory allocation failures with ACLs
  xfs: ensure we copy buffer type in da btree root splits
  xfs: set remote symlink buffer type for recovery
  xfs: recovery of swap extents operations for CRC filesystems
  xfs: swap extents operations for CRC filesystems
  xfs: check magic numbers in dir3 leaf verifier first
  xfs: fix some minor sparse warnings
  xfs: fix endian warning in xlog_recover_get_buf_lsn()
2013-09-12 16:13:41 -07:00
Linus Torvalds
48efe453e6 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
 "Lots of activity again this round for I/O performance optimizations
  (per-cpu IDA pre-allocation for vhost + iscsi/target), and the
  addition of new fabric independent features to target-core
  (COMPARE_AND_WRITE + EXTENDED_COPY).

  The main highlights include:

   - Support for iscsi-target login multiplexing across individual
     network portals
   - Generic Per-cpu IDA logic (kent + akpm + clameter)
   - Conversion of vhost to use per-cpu IDA pre-allocation for
     descriptors, SGLs and userspace page pointer list
   - Conversion of iscsi-target + iser-target to use per-cpu IDA
     pre-allocation for descriptors
   - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet)
     emulation for virtual backend drivers
   - Add support for generic EXTENDED_COPY (CopyOffload) emulation for
     virtual backend drivers.
   - Add support for fast memory registration mode to iser-target (Vu)

  The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of
  particular significance, which make us the first and only open source
  target to support the full set of VAAI primitives.

  Currently Linux clients are lacking upstream support to actually
  utilize these primitives.  However, with server side support now in
  place for folks like MKP + ZAB working on the client, this logic once
  reserved for the highest end of storage arrays, can now be run in VMs
  on their laptops"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits)
  target/iscsi: Bump versions to v4.1.0
  target: Update copyright ownership/year information to 2013
  iscsi-target: Bump default TCP listen backlog to 256
  target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out
  iscsi-target; Bump default CmdSN Depth to 64
  iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set
  iscsi-target: Add thread_set->ts_activate_sem + use common deallocate
  iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE
  target: remove unused including <linux/version.h>
  iser-target: introduce fast memory registration mode (FRWR)
  iser-target: generalize rdma memory registration and cleanup
  iser-target: move rdma wr processing to a shared function
  target: Enable global EXTENDED_COPY setup/release
  target: Add Third Party Copy (3PC) bit in INQUIRY response
  target: Enable EXTENDED_COPY setup in spc_parse_cdb
  target: Add support for EXTENDED_COPY copy offload emulation
  target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check
  target: Add global device list for EXTENDED_COPY
  target: Make helpers non static for EXTENDED_COPY command setup
  target: Make spc_parse_naa_6h_vendor_specific non static
  ...
2013-09-12 16:11:45 -07:00
Linus Torvalds
ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Chen Gang
de32a8177f mm/Kconfig: add MMU dependency for MIGRATION.
MIGRATION must depend on MMU, or allmodconfig for the nommu sh
architecture fails to build:

    CC      mm/migrate.o
  mm/migrate.c: In function 'remove_migration_pte':
  mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
     if (pmd_trans_huge(*pmd))
     ^
  mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
    if (!is_swap_pte(pte))
    ^
  ...

Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
select MIGRATION by force.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Jingoo Han
6072ddc852 kernel: replace strict_strto*() with kstrto*()
The usage of strict_strto*() is not preferred, because strict_strto*() is
obsolete.  Thus, kstrto*() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
David Rientjes
17766dde36 mm, thp: count thp_fault_fallback anytime thp fault fails
Currently, thp_fault_fallback in vmstat only gets incremented if a
hugepage allocation fails.  If current's memcg hits its limit or the page
fault handler returns an error, it is incorrectly accounted as a
successful thp_fault_alloc.

Count thp_fault_fallback anytime the page fault handler falls back to
using regular pages and only count thp_fault_alloc when a hugepage has
actually been faulted.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
c02925540c thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
128ec037ba thp: do_huge_pmd_anonymous_page() cleanup
Minor cleanup: unindent most code of the fucntion by inverting one
condition.  It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
3122359a64 thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
It's confusing that mk_huge_pmd() has semantics different from mk_pte() or
mk_pmd().  I spent some time on debugging issue cased by this
inconsistency.

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype
to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
66a0c8ee3d mm: cleanup add_to_page_cache_locked()
Make add_to_page_cache_locked() cleaner:

 - unindent most code of the function by inverting one condition;
 - streamline code no-error path;
 - move insert error path outside normal code path;
 - call radix_tree_preload_end() earlier;

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
3cd14fcd3f thp: account anon transparent huge pages into NR_ANON_PAGES
We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov
7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Chris Metcalf
5fbc461636 mm: make lru_add_drain_all() selective
make lru_add_drain_all() only selectively interrupt the cpus that have
per-cpu free pages that can be drained.

This is important in nohz mode where calling mlockall(), for example,
otherwise will interrupt every core unnecessarily.

This is important on workloads where nohz cores are handling 10 Gb traffic
in userspace.  Those CPUs do not enter the kernel and place pages into LRU
pagevecs and they really, really don't want to be interrupted, or they
drop packets on the floor.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
9cb2dc1c95 memcg: document cgroup dirty/writeback memory statistics
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
3ea67d06e4 memcg: add per cgroup writeback pages accounting
Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd52f ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag.  But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting".  So in order to avoid the race we have designed a new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
	--> memcg->move_lock
	  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
658b72c5a7 memcg: check for proper lock held in mem_cgroup_update_page_stat
We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the latter
doesn't do any checking that we use proper locking, which would be hard.
Suggested by Michal Hock we could at least test for rcu_read_lock_held()
because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
68b4876d99 memcg: remove MEMCG_NR_FILE_MAPPED
While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead.  We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
1a36e59d48 memcg: reduce function dereference
This function dereferences res far too often, so optimize it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
3af3351676 memcg: avoid overflow caused by PAGE_ALIGN
Since PAGE_ALIGN is aligning up(the next page boundary), so after
PAGE_ALIGN, the value might be overflow, such as write the MAX value to
*.limit_in_bytes.

  $ cat /cgroup/memory/memory.limit_in_bytes
  18446744073709551615

  # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
  bash: echo: write error: Invalid argument

Some user programs might depend on such behaviours(like libcg, we read
the value in snapshot, then use the value to reset cgroup later), and
that will cause confusion.  So we need to fix it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
6de5a8bfca memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
34ff8dc089 memcg: correct RESOURCE_MAX to ULLONG_MAX
Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
limit is unsigned long long, so we can set bigger value than that which is
strange.  The XXX_MAX should be reasonable max value, bigger than that
should be overflow.

Notice that this change will affect user output of default *.limit_in_bytes:
before change:

  $ cat /cgroup/memory/memory.limit_in_bytes
  9223372036854775807

after change:

  $ cat /cgroup/memory/memory.limit_in_bytes
  18446744073709551615

But it doesn't alter the API in term of input - we can still use "echo -1
> *.limit_in_bytes" to reset the numbers to "unlimited".

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Johannes Weiner
3812c8c8f3 mm: memcg: do not trap chargers with full callstack on OOM
The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Johannes Weiner
fb2a6fc56b mm: memcg: rework and document OOM waiting and wakeup
The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies.  However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now.  Clean
up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
   makes it more obvious that the task has to be on the waitqueue
   before attempting to OOM-trylock the hierarchy, to not miss any
   wakeups before going to sleep.  It just didn't matter until now
   because it was all lumped together into the global memcg_oom_lock
   spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
   It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
   lock in any given hierarchy atomically.  Restrict its scope to
   mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
   function.  Only the lockholder has to wake up the next in line
   after releasing the lock.

   Note that the lockholder kicks off the OOM-killer, which in turn
   leads to wakeups from the uncharges of the exiting task.  But a
   contender is not guaranteed to see them if it enters the OOM path
   after the OOM kills but before the lockholder releases the lock.
   Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
   under OOM as that is the point where we start to receive wakeups.
   No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
   symmetry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:01 -07:00