Commit Graph

425938 Commits

Author SHA1 Message Date
Tejun Heo 398f878789 Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
Pull for-3.14-fixes to receive 0ab02ca8f8 ("cgroup: protect
modifications to cgroup_idr with cgroup_mutex") prior to kernfs
conversion series to avoid non-trivial conflicts.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-11 11:02:59 -05:00
Li Zefan 0ab02ca8f8 cgroup: protect modifications to cgroup_idr with cgroup_mutex
Setup cgroupfs like this:
  # mount -t cgroup -o cpuacct xxx /cgroup
  # mkdir /cgroup/sub1
  # mkdir /cgroup/sub2

Then run these two commands:
  # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
  # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &

After seconds you may see this warning:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
 [<ffffffff8156063c>] dump_stack+0x7a/0x96
 [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
 [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81300aa7>] sub_remove+0x87/0x1b0
 [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
 [<ffffffff81300bf5>] idr_remove+0x25/0xd0
 [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
 [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
 [<ffffffff8107cdbc>] process_one_work+0x26c/0x550
 [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
 [<ffffffff81085f96>] kthread+0xe6/0xf0
 [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---

It's because allocating/removing cgroup ID is not properly synchronized.

The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().

tj: Refreshed on top of b58c89986a ("cgroup: fix error return from
cgroup_create()").

Fixes: 4e96ee8e98 ("cgroup: convert cgroup_ida to cgroup_idr")
Cc: <stable@vger.kernel.org> #3.12+
Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-11 10:38:30 -05:00
Tejun Heo f7cef064aa Merge branch 'driver-core-next' into cgroup/for-3.15
Pending kernfs conversion depends on kernfs improvements in
driver-core-next.  Pull it into for-3.15.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-08 10:37:44 -05:00
Tejun Heo 1a698a4aba Merge branch 'for-3.14-fixes' into for-3.15
Pending kernfs conversion depends on fixes in for-3.14-fixes.  Pull it
into for-3.15.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-08 10:37:14 -05:00
Tejun Heo 3417ae1f5f cgroup: remove cgroup_root_mutex
cgroup_root_mutex was added to avoid deadlock involving namespace_sem
via cgroup_show_options().  It added a lot of overhead for the small
purpose of it and, because it's nested under cgroup_mutex, it has very
limited usefulness.  The previous patch made cgroup_show_options() not
use cgroup_root_mutex, so nobody needs it anymore.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08 10:37:01 -05:00
Tejun Heo 69e943b7d3 cgroup: update locking in cgroup_show_options()
cgroup_show_options() grabs cgroup_root_mutex to protect the options
changing while printing; however, holding root_mutex or not doesn't
really make much difference for the function.  subsys_mask can be
atomically tested and most of the options aren't allowed to change
anyway once mounted.

The only field which needs synchronization is ->release_agent_path.
This patch introduces a dedicated spinlock to synchronize accesses to
the field and drops cgroup_root_mutex locking from
cgroup_show_options().  The next patch will remove cgroup_root_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08 10:36:58 -05:00
Tejun Heo aec25020f5 cgroup: rename cgroup_subsys->subsys_id to ->id
It's no longer referenced outside cgroup core, so renaming is easy.
Let's rename it for consistency & brevity.

This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08 10:36:58 -05:00
Tejun Heo 073219e995 cgroup: clean up cgroup_subsys names and initialization
cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
  defined in cgroup_subsys.h.  Most subsystems use the matching name
  but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
  cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
  included throughout various subsystems, it doesn't and shouldn't
  have claim on such generic names which don't have any qualifier
  indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
  cgroup_subsys_id enum; however, we require each controller to
  initialize it and then BUG if they don't match, which is a bit
  silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
  cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
  visible names doesn't cause any naming conflicts.  All non-matching
  identifiers are renamed to match the official names.

  cpu_cgroup -> cpu
  mem_cgroup -> memory
  perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
  They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
  WARN()s.  BUGging that early during boot is stupid - the kernel
  can't print anything, even through serial console and the trap
  handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
    classid handling into core").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
2014-02-08 10:36:58 -05:00
Tejun Heo 3ed80a62bf cgroup: drop module support
With module supported dropped from net_prio, no controller is using
cgroup module support.  None of actual resource controllers can be
built as a module and we aren't gonna add new controllers which don't
control resources.  This patch drops module support from cgroup.

* cgroup_[un]load_subsys() and cgroup_subsys->module removed.

* As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
  cgroup_subsys.h now uses IS_ENABLED() directly.

* enum cgroup_subsys_id now exactly matches the list of enabled
  controllers as ordered in cgroup_subsys.h.

* cgroup_subsys[] is now a contiguously occupied array.  Size
  specification is no longer necessary and dropped.

* for_each_builtin_subsys() is removed and for_each_subsys() is
  updated to not require any locking.

* module ref handling is removed from rebind_subsystems().

* Module related comments dropped.

v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
    classid handling into core").

v3: Added {} around the if (need_forkexit_callback) block in
    cgroup_post_fork() for readability as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08 10:36:58 -05:00
Tejun Heo af6363374c cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup()
net_prio is the only cgroup which is allowed to be built as a module.
The savings from allowing one controller to be built as a module are
tiny especially given that cgroup module support itself adds quite a
bit of complexity.

Given that none of other controllers has much chance of being made a
module and that we're unlikely to add new modular controllers, the
added complexity is simply not justifiable.

As a first step to drop cgroup module support, this patch changes the
config option to bool from tristate and drops module related code from
it.

Also, while an earlier commit fe1217c4f3 ("net: net_cls: move
cgroupfs classid handling into core") dropped module support from
net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
noop for built-in controllers.  Drop it along with
init_netclassid_cgroup().

v2: Removed modular version of task_netprioidx() in
    include/net/netprio_cgroup.h as suggested by Li Zefan.

v3: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
    classid handling into core").  net_cls cgroup part is mostly
    dropped except for removal of init_netclassid_cgroup().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
2014-02-08 10:36:58 -05:00
Tejun Heo 48573a8933 cgroup: fix locking in cgroup_cfts_commit()
cgroup_cfts_commit() walks the cgroup hierarchy that the target
subsystem is attached to and tries to apply the file changes.  Due to
the convolution with inode locking, it can't keep cgroup_mutex locked
while iterating.  It currently holds only RCU read lock around the
actual iteration and then pins the found cgroup using dget().

Unfortunately, this is incorrect.  Although the iteration does check
cgroup_is_dead() before invoking dget(), there's nothing which
prevents the dentry from going away inbetween.  Note that this is
different from the usual css iterations where css_tryget() is used to
pin the css - css_tryget() tests whether the css can be pinned and
fails if not.

The problem can be solved by simply holding cgroup_mutex instead of
RCU read lock around the iteration, which actually reduces LOC.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2014-02-08 10:26:34 -05:00
Tejun Heo b58c89986a cgroup: fix error return from cgroup_create()
cgroup_create() was returning 0 after allocation failures.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2014-02-08 10:26:33 -05:00
Tejun Heo eb46bf8969 cgroup: fix error return value in cgroup_mount()
When cgroup_mount() fails to allocate an id for the root, it didn't
set ret before jumping to unlock_drop ending up returning 0 after a
failure.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2014-02-08 10:26:33 -05:00
Tejun Heo ba341d55a4 kernfs: add CONFIG_KERNFS
As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon.  Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:08:57 -08:00
Tejun Heo fa4cd451cc sysfs, kobject: add sysfs wrapper for kernfs_enable_ns()
Currently, kobject is invoking kernfs_enable_ns() directly.  This is
fine now as sysfs and kernfs are enabled and disabled together.  If
sysfs is disabled, kernfs_enable_ns() is switched to dummy
implementation too and everything is fine; however, kernfs will soon
have its own config option CONFIG_KERNFS and !SYSFS && KERNFS will be
possible, which can make kobject call into non-dummy
kernfs_enable_ns() with NULL kernfs_node pointers leading to an oops.

Introduce sysfs_enable_ns() which is a wrapper around
kernfs_enable_ns() so that it can be made a noop depending only on
CONFIG_SYSFS regardless of the planned CONFIG_KERNFS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:08:57 -08:00
Tejun Heo 3eef34ad7d kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends
kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name()		- format the node's name into the specified buffer
* kernfs_path()		- format the node's path into the specified buffer
* pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent()	- pin and return a node's parent

All can be called under any context.  The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
    static inline making it cause a lot of build warnings.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:05:35 -08:00
Tejun Heo 0c23b2259a kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename()
Implement helpers to determine node from dentry and root from
super_block.  Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace.  These generally make sense and will be used by cgroup.

v2: Some dummy implementations for !CONFIG_SYSFS was missing.  Fixed.
    Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:00:41 -08:00
Tejun Heo 2536390da0 kernfs: add kernfs_open_file->priv
Add a private data field to be used by kernfs file operations.  This
generally makes sense and will be used by cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 16:00:40 -08:00
Tejun Heo 4d3773c4bb kernfs: implement kernfs_ops->atomic_write_len
A write to a kernfs_node is buffered through a kernel buffer.  Writes
<= PAGE_SIZE are performed atomically, while larger ones are executed
in PAGE_SIZE chunks.  While this is enough for sysfs, cgroup which is
scheduled to be converted to use kernfs needs a bit more control over
it.

This patch adds kernfs_ops->atomic_write_len.  If not set (zero), the
behavior stays the same.  If set, writes upto the size are executed
atomically and larger writes are rejected with -E2BIG.

A different implementation strategy would be allowing configuring
chunking size while making the original write size available to the
write method; however, such strategy, while being more complicated,
doesn't really buy anything.  If the write implementation has to
handle chunking, the specific chunk size shouldn't matter all that
much.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo d35258ef70 kernfs: allow nodes to be created in the deactivated state
Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes.  In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.

This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API.  Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo b9c9dad0c4 kernfs: add missing kernfs_active() checks in directory operations
kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
missing kernfs_active() tests before using the found kernfs_node.  As
deactivated state is currently visible only while a node is being
removed, this doesn't pose an actual problem.  e.g. lookup succeeding
on a deactivated node doesn't harm anything as the eventual file
operations are gonna fail and those failures are indistinguishible
from the cases in which the lookups had happened before the node was
deactivated.

However, we're gonna allow new nodes to be created deactivated and
then activated explicitly by the kernfs user when it sees fit.  This
is to support atomically making multiple nodes visible to userland and
thus those nodes must not be visible to userland before activated.

Let's plug the lookup and readdir holes so that deactivated nodes are
invisible to userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo 6a7fed4eef kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options()
Add two super_block related syscall callbacks ->remount_fs() and
->show_options() to kernfs_syscall_ops.  These simply forward the
matching super_operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo 90c07c895c kernfs: rename kernfs_dir_ops to kernfs_syscall_ops
We're gonna need non-dir syscall callbacks, which will make dir_ops a
misnomer.  Let's rename kernfs_dir_ops to kernfs_syscall_ops.

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo 07c7530dd4 kernfs: invoke dir_ops while holding active ref of the target node
kernfs_dir_ops are currently being invoked without any active
reference, which makes it tricky for the invoked operations to
determine whether the objects associated those nodes are safe to
access and will remain that way for the duration of such operations.

kernfs already has active_ref mechanism to deal with this which makes
the removal of a given node the synchronization point for gating the
file operations.  There's no reason for dir_ops to be any different.
Update the dir_ops handling so that active_ref is held while the
dir_ops are executing.  This guarantees that while a dir_ops is
executing the target nodes stay alive.

As kernfs_dir_ops doesn't have any in-kernel user at this point, this
doesn't affect anybody.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:52:48 -08:00
Tejun Heo ce8b04aa6c sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
All device_schedule_callback_owner() users are converted to use
device_remove_file_self().  Remove now unused
{sysfs|device}_schedule_callback_owner().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo 0b60f9ead5 s390: use device_remove_file_self() instead of device_schedule_callback()
driver-core now supports synchrnous self-deletion of attributes and
the asynchrnous removal mechanism is scheduled for removal.  Use it
instead of device_schedule_callback().

* Conversions in arch/s390/pci/pci_sysfs.c and
  drivers/s390/block/dcssblk.c are straightforward.

* drivers/s390/cio/ccwgroup.c is a bit more tricky because
  ccwgroup_notifier() was (ab)using device_schedule_callback() to
  purely obtain a process context to kick off ungroup operation which
  may block from a notifier callback.

  Rename ccwgroup_ungroup_callback() to ccwgroup_ungroup() and make it
  take ccwgroup_device * instead.  The new function is now called
  directly from ccwgroup_ungroup_store().

  ccwgroup_notifier() chain is updated to explicitly bounce through
  ccwgroup_device->ungroup_work.  This also removes possible failure
  from memory pressure.

Only compile-tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: linux-s390@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo ac0ece9174 scsi: use device_remove_file_self() instead of device_schedule_callback()
driver-core now supports synchrnous self-deletion of attributes and
the asynchrnous removal mechanism is scheduled for removal.  Use it
instead of device_schedule_callback().  This makes "delete" behave
synchronously.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo bc6caf02cc pci: use device_remove_file_self() instead of device_schedule_callback()
driver-core now supports synchrnous self-deletion of attributes and
the asynchrnous removal mechanism is scheduled for removal.  Use it
instead of device_schedule_callback().  This makes "remove" behave
synchronously.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo 6b0afc2a21 kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers
Sometimes it's necessary to implement a node which wants to delete
nodes including itself.  This isn't straightforward because of kernfs
active reference.  While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing.  For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously.  If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self.  This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core.  kernfs_remove_self() is to be called from one of the file
operations, drops the active ref the task is holding, removes the self
node, and restores active ref to the dead node so that the ref is
balanced afterwards.  __kernfs_remove() is updated so that it takes an
early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy.  The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node.  The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path.  kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback().  A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once.  An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion.  All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false.  This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

Note that manipulation of active ref is implemented in separate public
functions - kernfs_[un]break_active_protection().
kernfs_remove_self() is the only user at the moment but this will be
used to cater to more complex cases.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
    and sysfs_remove_file_self() had incorrect return type.  Fix it.
    Reported by kbuild test bot.

v3: kernfs_[un]break_active_protection() separated out from
    kernfs_remove_self() and exposed as public API.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo 81c173cb5e kernfs: remove KERNFS_REMOVED
KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps that of deactivation.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations.  There's no reason to have them separate making
things more complex than necessary.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
  deactivated.  This means that we now use both atomic_add() and
  atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN.  The compiler
  generates an overflow warnings when negating INT_MIN as the negation
  can't be represented as a positive number.  Nothing is actually
  broken but let's bump BIAS by one to avoid the warnings for archs
  which negates the subtrahend..

* A new helper kernfs_active() which tests whether kn->active >= 0 is
  added for convenience and lockdep annotation.  All KERNFS_REMOVED
  tests are replaced with negated kernfs_active() tests.

* __kernfs_remove() is updated to deactivate, but not drain, all nodes
  in the subtree instead of setting KERNFS_REMOVED.  This removes
  deactivation from kernfs_deactivate(), which is now renamed to
  kernfs_drain().

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
  checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring.  kernfs_active()
    dropped and kernfs_get/put_active() used instead.  RB_EMPTY_NODE()
    used in the lookup paths.

v3: Reverted most of v2 except for creating a new node with
    KN_DEACTIVATED_BIAS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:41 -08:00
Tejun Heo 182fd64b66 kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()
There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().

While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths.  Let's drop KERNFS_ACTIVE_REF.

While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
    KERNFS_LOCKDEP flag").  As the earlier patch already added
    KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
    dropped from this patch and the existing ones are simply converted
    to kernfs_lockdep().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Tejun Heo 988cd7afb3 kernfs: remove kernfs_addrm_cxt
kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes.  The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release kernfs_mutex itself.
  sysfs_addrm_start/finish() invocations around it are removed from
  all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
  it to kernfs_addrm_cxt.  Its callers are updated to grab and release
  kernfs_mutex instead of calling kernfs_addrm_start/finish() around
  it.

v2: Rebased on top of "kernfs: associate a new kernfs_node with its
    parent on creation" which dropped @parent from kernfs_add_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Tejun Heo ccf02aaf81 kernfs: invoke kernfs_unmap_bin_file() directly from kernfs_deactivate()
kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.

It can be easily fixed by moving kernfs_unmap_bin_file() invocation
from kernfs_addrm_finish() to kernfs_deactivated().  The function may
be called multiple times but that shouldn't do any harm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Tejun Heo 35beab0635 kernfs: restructure removal path to fix possible premature return
The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants.  While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce.  It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.

The current behavior in such situations is broken.  Whoever enters
removal path first takes the node off the hierarchy and then
deactivates.  Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy.  In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.

This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.

* When there are multiple concurrent removers, only one puts the base
  ref.

* Regardless of which one puts the base ref, all removers are blocked
  until the target node is fully deactivated and removed.

To achieve the above, removal path now first marks all descendants
including self REMOVED and then deactivates and unlinks leftmost
descendant one-by-one.  kernfs_deactivate() is called directly from
__kernfs_removal() and drops and regrabs kernfs_mutex for each
descendant to drain active refs.  As this means that multiple removers
can enter kernfs_deactivate() for the same node, the function is
updated so that it can handle multiple deactivators of the same node -
only one actually deactivates but all wait till drain completion.

The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained.  Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.

v2: Draining separated into a separate loop (used to be in the same
    loop as unlink) and done from __kernfs_deactivate().  This is to
    allow exposing deactivation as a separate interface later.

    Root node removal was broken in v1 patch.  Fixed.

v3: Revert most of v2 except for root node removal fix and
    simplification of KERNFS_REMOVED setting loop.

v4: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
    KERNFS_LOCKDEP flag").

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Tejun Heo abd54f028e kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq
kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate().  We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal.  The removal path will be restructured to address the issue.

To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq.  This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
    KERNFS_LOCKDEP flag").

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Tejun Heo a6607930b6 kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.

 usb 1-2: USB disconnect, device number 2
 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.
 CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
  ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
  ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
 Call Trace:
  [<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
  [<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
  [<ffffffff810f931a>] lock_acquire+0x9a/0x1d0
  [<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
  [<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
  [<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
  [<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
  [<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
  [<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
  [<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
  [<ffffffff8177e25e>] device_del+0x12e/0x1d0
  [<ffffffff819722c2>] usb_disconnect+0x122/0x1a0
  [<ffffffff819749b5>] hub_thread+0x3c5/0x1290
  [<ffffffff810c6a6d>] kthread+0xed/0x110
  [<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0

Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:42:40 -08:00
Colin Cross fee0c54e28 dma-buf: avoid using IS_ERR_OR_NULL
dma_buf_map_attachment and dma_buf_vmap can return NULL or
ERR_PTR on a error.  This encourages a common buggy pattern in
callers:
	sgt = dma_buf_map_attachment(attach, DMA_BIDIRECTIONAL);
	if (IS_ERR_OR_NULL(sgt))
                return PTR_ERR(sgt);

This causes the caller to return 0 on an error.  IS_ERR_OR_NULL
is almost always a sign of poorly-defined error handling.

This patch converts dma_buf_map_attachment to always return
ERR_PTR, and fixes the callers that incorrectly handled NULL.
There are a few more callers that were not checking for NULL
at all, which would have dereferenced a NULL pointer later.
There are also a few more callers that correctly handled NULL
and ERR_PTR differently, I left those alone but they could also
be modified to delete the NULL check.

This patch also converts dma_buf_vmap to always return NULL.
All the callers to dma_buf_vmap only check for NULL, and would
have dereferenced an ERR_PTR and panic'd if one was ever
returned. This is not consistent with the rest of the dma buf
APIs, but matches the expectations of all of the callers.

Signed-off-by: Colin Cross <ccross@android.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 14:21:09 -08:00
Hugh Dickins ab3f5faa62 cgroup: use an ordered workqueue for cgroup destruction
Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Just use an ordered workqueue for cgroup_destroy_wq.

tj: Committing as the temporary fix until the reverse dependency can
    be removed from memcg.  Comment updated accordingly.

Fixes: e5fca243ab ("cgroup: use a dedicated workqueue for cgroup destruction")
Suggested-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-07 10:21:12 -05:00
Tejun Heo 0a6be65553 nfs: include xattr.h from fs/nfs/nfs3proc.c
fs/nfs/nfs3proc.c is making use of xattr but was getting linux/xattr.h
indirectly through linux/cgroup.h, which will soon drop the inclusion
of xattr.h.  Explicitly include linux/xattr.h from nfs3proc.c so that
compilation doesn't fail when linux/cgroup.h drops linux/xattr.h.

As the following cgroup changes will depend on these changes, it
probably would be easier to route this through cgroup branch.  Would
that be okay?

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
2014-02-03 15:43:59 -05:00
Li Zefan 230579d7e0 cpuset: update MAINTAINERS entry
Add mailing list and tree tag to the entry.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-03 13:24:05 -05:00
Tejun Heo 1ff6bbfd13 arm, pm, vmpressure: add missing slab.h includes
arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
were somehow getting slab.h indirectly through cgroup.h which in turn
was getting it indirectly through xattr.h.  A scheduled cgroup change
drops xattr.h inclusion from cgroup.h and breaks compilation of these
three files.  Add explicit slab.h includes to the three files.

A pending cgroup patch depends on this change and it'd be great if
this can be routed through cgroup/for-3.14-fixes branch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: linux-tegra@vger.kernel.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
2014-02-03 13:24:01 -05:00
Linus Torvalds 38dbfb59d1 Linus 3.14-rc1 2014-02-02 16:42:13 -08:00
Linus Torvalds 69048e0188 Merge branch 'parisc-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
 "The three major changes in this patchset is a implementation for
  flexible userspace memory maps, cache-flushing fixes (again), and a
  long-discussed ABI change to make EWOULDBLOCK the same value as
  EAGAIN.

  parisc has been the only platform where we had EWOULDBLOCK != EAGAIN
  to keep HP-UX compatibility.  Since we will probably never implement
  full HP-UX support, we prefer to drop this compatibility to make it
  easier for us with Linux userspace programs which mostly never checked
  for both values.  We don't expect major fall-outs because of this
  change, and if we face some, we will simply rebuild the necessary
  applications in the debian archives"

* 'parisc-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: add flexible mmap memory layout support
  parisc: Make EWOULDBLOCK be equal to EAGAIN on parisc
  parisc: convert uapi/asm/stat.h to use native types only
  parisc: wire up sched_setattr and sched_getattr
  parisc: fix cache-flushing
  parisc/sti_console: prefer Linux fonts over built-in ROM fonts
2014-02-02 16:32:53 -08:00
Mikulas Patocka 1c0b8a7a62 hpfs: optimize quad buffer loading
HPFS needs to load 4 consecutive 512-byte sectors when accessing the
directory nodes or bitmaps.  We can't switch to 2048-byte block size
because files are allocated in the units of 512-byte sectors.

Previously, the driver would allocate a 2048-byte area using kmalloc,
copy the data from four buffers to this area and eventually copy them
back if they were modified.

In the current implementation of the buffer cache, buffers are allocated
in the pagecache.  That means that 4 consecutive 512-byte buffers are
stored in consecutive areas in the kernel address space.  So, we don't
need to allocate extra memory and copy the content of the buffers there.

This patch optimizes the code to avoid copying the buffers.  It checks
if the four buffers are stored in contiguous memory - if they are not,
it falls back to allocating a 2048-byte area and copying data there.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02 16:24:07 -08:00
Mikulas Patocka 2cbe5c76fc hpfs: remember free space
Previously, hpfs scanned all bitmaps each time the user asked for free
space using statfs.  This patch changes it so that hpfs scans the
bitmaps only once, remembes the free space and on next invocation of
statfs it returns the value instantly.

New versions of wine are hammering on the statfs syscall very heavily,
making some games unplayable when they're stored on hpfs, with load
times in minutes.

This should be backported to the stable kernels because it fixes
user-visible problem (excessive level load times in wine).

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02 16:24:07 -08:00
Helge Deller 9dabf60dc4 parisc: add flexible mmap memory layout support
Add support for the flexible mmap memory layout (as described in
http://lwn.net/Articles/91829). This is especially very interesting on
parisc since we currently only support 32bit userspace (even with a
64bit Linux kernel).

Signed-off-by: Helge Deller <deller@gmx.de>
2014-02-02 21:00:13 +01:00
Guy Martin f5a408d53e parisc: Make EWOULDBLOCK be equal to EAGAIN on parisc
On Linux, only parisc uses a different value for EWOULDBLOCK which
causes a lot of troubles for applications not checking for both values.
Since the hpux compat is long dead, make EWOULDBLOCK behave the same as
all other architectures.

Signed-off-by: Guy Martin <gmsoft@tuxicoman.be>
Signed-off-by: Helge Deller <deller@gmx.de>
2014-02-02 20:57:42 +01:00
Helge Deller 9391bc777b parisc: convert uapi/asm/stat.h to use native types only
The stat.h header file is exported to userspace. Some userspace
applications failed to compile due to missing/unknown types, so we
better convert it to use native types only (like it's done on other
architectures too).

Signed-off-by: Helge Deller <deller@gmx.de>
2014-02-02 20:57:33 +01:00
Helge Deller 998bbb2fc0 parisc: wire up sched_setattr and sched_getattr
Signed-off-by: Helge Deller <deller@gmx.de>
2014-02-02 20:57:25 +01:00
Helge Deller 57737c49dd parisc: fix cache-flushing
This commit:
f8dae00684d678afa13041ef170cecfd1297ed40: parisc: Ensure full cache coherency for kmap/kunmap
caused negative caching side-effects, e.g. hanging processes with expect and
too many inequivalent alias messages from flush_dcache_page() on Debian 5 systems.

This patch now partly reverts it and has been in production use on our debian buildd
makeservers since a week without any major problems.

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Helge Deller <deller@gmx.de>
2014-02-02 20:57:16 +01:00