Commit Graph

441299 Commits

Author SHA1 Message Date
Tejun Heo f3d4650015 cgroup: convert cgroup_has_live_children() into css_has_online_children()
Now that cgroup liveliness and css onliness are the same state,
convert cgroup_has_live_children() into css_has_online_children() so
that it can be used for actual csses too.  The function now uses
css_for_each_child() for iteration and is published.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:52 -04:00
Tejun Heo 184faf3232 cgroup: use CSS_ONLINE instead of CGRP_DEAD
Use CSS_ONLINE on the self css to indicate whether a cgroup has been
killed instead of CGRP_DEAD.  This will allow re-using css online test
for cgroup liveliness test.  This doesn't introduce any functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:51 -04:00
Tejun Heo c2931b70a3 cgroup: iterate cgroup_subsys_states directly
Currently, css_next_child() is implemented as finding the next child
cgroup which has the css enabled, which used to be the only way to do
it as only cgroups participated in sibling lists and thus could be
iteratd.  This works as long as what's required during iteration is
not missing online csses; however, it turns out that there are use
cases where offlined but not yet released csses need to be iterated.
This is difficult to implement through cgroup iteration the unified
hierarchy as there may be multiple dying csses for the same subsystem
associated with single cgroup.

After the recent changes, the cgroup self and regular csses behave
identically in how they're linked and unlinked from the sibling lists
including assertion of CSS_RELEASED and css_next_child() can simply
switch to iterating csses directly.  This both simplifies the logic
and ensures that all visible non-released csses are included in the
iteration whether there are multiple dying csses for a subsystem or
not.

As all other iterators depend on css_next_child() for sibling
iteration, this changes behaviors of all css iterators.  Add and
update explanations on the css states which are included in traversal
to all iterators.

As css iteration could always contain offlined csses, this shouldn't
break any of the current users and new usages which need iteration of
all on and offline csses can make use of the new semantics.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16 13:22:51 -04:00
Tejun Heo de3f034182 cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
css iterations allow the caller to drop RCU read lock.  As long as the
caller keeps the current position accessible, it can simply re-grab
RCU read lock later and continue iteration.  This is achieved by using
CGRP_DEAD to detect whether the current positions next pointer is safe
to dereference and if not re-iterate from the beginning to the next
position using ->serial_nr.

CGRP_DEAD is used as the marker to invalidate the next pointer and the
only requirement is that the marker is set before the next sibling
starts its RCU grace period.  Because CGRP_DEAD is set at the end of
cgroup_destroy_locked() but the cgroup is unlinked when the reference
count reaches zero, we currently have a rather large window where this
fallback re-iteration logic can be triggered.

This patch introduces CSS_RELEASED which is set when a css is unlinked
from its sibling list.  This still keeps the re-iteration logic
working while drastically reducing the window of its activation.
While at it, rewrite the comment in css_next_child() to reflect the
new flag and better explain the synchronization.

This will also enable iterating csses directly instead of through
cgroups.

v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by
    CSS_NO_REF.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:49 -04:00
Tejun Heo 0cb51d71c1 cgroup: move cgroup->serial_nr into cgroup_subsys_state
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks.  All csses including the cgroup->self and actual
ones now form trees through css->children and ->sibling which follow
the same rules as what cgroup->children and ->sibling followed.  This
patch moves cgroup->serial_nr which is used to implement css iteration
into css.

Note that all csses, regardless of their types, allocate their serial
numbers from the same monotonically increasing counter.  This doesn't
affect the ordering needed by css iteration or cause any other
material behavior changes.  This will be used to update css iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:49 -04:00
Tejun Heo 1fed1b2e36 cgroup: link all cgroup_subsys_states in their sibling lists
Currently, while all csses have ->children and ->sibling, only the
self csses of cgroups make use of them.  This patch makes all other
csses to link themselves on the sibling lists too.  This will be used
to update css iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:49 -04:00
Tejun Heo d5c419b68e cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks.  Let's move cgroup->sibling and ->children into
cgroup_subsys_state.  This is pure move without functional change and
only cgroup->self's fields are actually used.  Other csses will make
use of the fields later.

While at it, update init_and_link_css() so that it zeroes the whole
css before initializing it and remove explicit zeroing of ->flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:48 -04:00
Tejun Heo d51f39b05c cgroup: remove cgroup->parent
cgroup->parent is redundant as cgroup->self.parent can also be used to
determine the parent cgroup and we're moving towards using
cgroup_subsys_states as the fundamental structural blocks.  This patch
introduces cgroup_parent() which follows cgroup->self.parent and
removes cgroup->parent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:48 -04:00
Tejun Heo 5877019d97 device_cgroup: remove direct access to cgroup->children
Currently, devcg::has_children() directly tests cgroup->children for
list emptiness.  The field is not a published field and scheduled to
go away.  In addition, the test isn't strictly correct as devcg should
only care about children which are visible to userland.

This patch converts has_children() to use css_next_child() instead.
The subtle incorrectness is noted and will be dealt with later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16 13:22:48 -04:00
Tejun Heo ea280e7b40 memcg: update memcg_has_children() to use css_next_child()
Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
directly test cgroup->children for list emptiness.  It's semantically
correct in traditional hierarchies as it actually wants to test for
any children dead or alive; however, cgroup->children is not a
published field and scheduled to go away.

This patch moves out .use_hierarchy test out of memcg_has_children()
and updates it to use css_next_child() to test whether there exists
any children.  With .use_hierarchy test moved out, it can also be used
by mem_cgroup_hierarchy_write().

A side note: As .use_hierarchy is going away, it doesn't really matter
but I'm not sure about how it's used in __memcg_activate_kmem().  The
condition tested by memcg_has_children() is mushy when seen from
userland as its result is affected by dead csses which aren't visible
from userland.  I think the rule would be a lot clearer if we have a
dedicated "freshly minted" flag which gets cleared when the first task
is migrated into it or the first child is created and then gate
activation with that.

v2: Added comment noting that testing use_hierarchy is the
    responsibility of the callers of memcg_has_children() as suggested
    by Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16 13:22:48 -04:00
Michal Hocko f61c42a7d9 memcg: remove tasks/children test from mem_cgroup_force_empty()
Tejun has correctly pointed out that tasks/children test in
mem_cgroup_force_empty is not correct because there is no other locking
which preserves this state throughout the rest of the function so both
new tasks can join the group or new children groups can be added while
somebody is writing to memory.force_empty. A new task would break
mem_cgroup_reparent_charges expectation that all failures as described
by mem_cgroup_force_empty_list are temporal and there is no way out.

The main use case for the knob as described by
Documentation/cgroups/memory.txt is to:
"
  The typical use case for this interface is before calling rmdir().
  Because rmdir() moves all pages to parent, some out-of-use page caches can be
  moved to the parent. If you want to avoid that, force_empty will be useful.
"

This means that reparenting is not really required as rmdir will
reparent pages implicitly from the safe context. If we remove it from
mem_cgroup_force_empty then we are safe even with existing tasks because
the number of reclaim attempts is bounded. Moreover the knob still does
what the documentation claims (modulo reparenting which doesn't make any
difference) and users might expect. Longterm we want to deprecate the
whole knob and put the reparented pages to the tail of parent LRU during
cgroup removal.

tj: Removed unused variable @cgrp from mem_cgroup_force_empty()

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-16 13:22:48 -04:00
Tejun Heo 5c9d535b89 cgroup: remove css_parent()
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent.  It was quite some
time ago and we're moving forward with making css more prominent.

This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent.  While at it, explicitly mark fields of css
which are public and immutable.

v2: New usage from device_cgroup.c converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16 13:22:48 -04:00
Tejun Heo 3b514d24e2 cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
9395a45004 ("cgroup: enable refcnting for root csses") enabled
reference counting for root csses (cgroup_subsys_states) so that
cgroup's self csses can be used to manage the lifetime of the
containing cgroups.

Unfortunately, this change was incorrect.  During early init,
cgrp_dfl_root self css refcnt is used.  percpu_ref can't initialized
during early init and its initialization is deferred till
cgroup_init() time.  This means that cpu was using percpu_ref which
wasn't properly initialized.  Due to the way percpu variables are laid
out on x86, this didn't blow up immediately on x86 but ended up
incrementing and decrementing the percpu variable at offset zero,
whatever it may be; however, on other archs, this caused fault and
early boot failure.

As cgroup self csses for root cgroups of non-dfl hierarchies need
working refcounting, we can't revert 9395a45004.  This patch adds
CSS_NO_REF which explicitly inhibits reference counting on the css and
sets it on all normal (non-self) csses and cgroup_dfl_root self css.

v2: cgrp_dfl_root.self is the offending one.  Set the flag on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Warren <swarren@nvidia.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Fixes: 9395a45004 ("cgroup: enable refcnting for root csses")
2014-05-16 13:22:47 -04:00
Tejun Heo 9d755d33f0 cgroup: use cgroup->self.refcnt for cgroup refcnting
Currently cgroup implements refcnting separately using atomic_t
cgroup->refcnt.  The destruction paths of cgroup and css are rather
complex and bear a lot of similiarities including the use of RCU and
bouncing to a work item.

This patch makes cgroup use the refcnt of self css for refcnting
instead of using its own.  This makes cgroup refcnting use css's
percpu refcnt and share the destruction mechanism.

* css_release_work_fn() and css_free_work_fn() are updated to handle
  both csses and cgroups.  This is a bit messy but should do until we
  can make cgroup->self a full css, which currently can't be done
  thanks to multiple hierarchies.

* cgroup_destroy_locked() now performs
  percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).

* Negative refcnt sanity check in cgroup_get() is no longer necessary
  as percpu_ref already handles it.

* Similarly, as a cgroup which hasn't been killed will never be
  released regardless of its refcnt value and percpu_ref has sanity
  check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
  longer necessary.

* As whether a refcnt reached zero or not can only be decided after
  the reference count is killed, cgroup_root->cgrp's refcnting can no
  longer be used to decide whether to kill the root or not.  Let's
  make cgroup_kill_sb() explicitly initiate destruction if the root
  doesn't have any children.  This makes sense anyway as unmounted
  cgroup hierarchy without any children should be destroyed.

While this is a bit messy, this will allow pushing more bookkeeping
towards cgroup->self and thus handling cgroups and csses in more
uniform way.  In the very long term, it should be possible to
introduce a base subsystem and convert the self css to a proper one
making things whole lot simpler and unified.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo 9395a45004 cgroup: enable refcnting for root csses
Currently, css_get(), css_tryget() and css_tryget_online() are noops
for root csses as an optimization; however, we're planning to use css
refcnts to track of cgroup lifetime too and root cgroups also need to
be reference counted.  Since css has been converted to percpu_refcnt,
the overhead of refcnting is miniscule and this optimization isn't too
meaningful anymore.  Furthermore, controllers which optimize the root
cgroup often never even invoke these functions in their hot paths.

This patch enables refcnting for root csses too.  This makes CSS_ROOT
flag unused and removes it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo 25e15d8350 cgroup: bounce css release through css->destroy_work
css release is planned to do more and would require process context.
Bounce it through css->destroy_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo 249f3468a2 cgroup: remove cgroup_destory_css_killed()
cgroup_destroy_css_killed() is cgroup destruction stage which happens
after all csses are offlined.  After the recent updates, it no longer
does anything other than putting the base reference.  This patch
removes the function and makes cgroup_destroy_locked() put the base
ref at the end isntead.

This also makes cgroup->nr_css unnecessary.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 4e4e284723 cgroup: move cgroup->sibling unlinking to cgroup_put()
Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
cgroup_put().  This is later but still before the RCU grace period, so
it doesn't break css_next_child() although there now is a larger
window in which a dead cgroup is visible during css iteration.  As css
iteration always could have included offline csses, this doesn't
affect correctness; however, it does make css_next_child() fall back
to reiterting mode more often.  This also makes cgroup_put() directly
take cgroup_mutex, which limits where it can be called from.  These
are not immediately problematic and will be dealt with later.

This change enables simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 9e4173e1f2 cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked()
Currently, check_for_release() on the parent of a destroyed cgroup is
invoked from cgroup_destroy_css_killed().  This is because this is
where the destroyed cgroup can be removed from the parent's children
list.  check_for_release() tests the emptiness of the list directly,
so invoking it before removing the cgroup from the list makes it think
that the parent still has children even when it no longer does.

This patch updates check_for_release() to use
cgroup_has_live_children() instead of directly testing ->children
emptiness and moves check_for_release(parent) earlier to the end of
cgroup_destroy_locked().  As cgroup_has_live_children() ignores
cgroups marked DEAD, check_for_release() functions correctly as long
as it's called after asserting DEAD.

This makes release notification slightly more timely and more
importantly enables further simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo cbc125efad cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked()
We're expecting another user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 9d800df12d cgroup: rename cgroup->dummy_css to ->self and move it to the top
cgroup->dummy_css is used as the placeholder css when performing css
oriended operations on the cgroup.  We're gonna shift more cgroup
management to this css.  Let's rename it to ->self and move it to the
top.

This is pure rename and field relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Tejun Heo a015edd26e cgroup: use restart_syscall() for mount retries
cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
which is being destroyed.  The retry currently loops inside
cgroup_mount() proper.  This patch makes it return with
restart_syscall() instead so that retry travels out to userland
boundary.

This slightly simplifies the logic and more importantly makes the
retry logic behave better when the wait for some reason becomes
lengthy or infinite by allowing the operation to be suspended or
terminated from userland.

v2: The original patch forgot to free memory allocated for @opts.
    Fixed.  Caught by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Tejun Heo 8353da1f91 cgroup: remove cgroup_tree_mutex
cgroup_tree_mutex was introduced to work around the circular
dependency between cgroup_mutex and kernfs active protection - some
kernfs file and directory operations needed cgroup_mutex putting
cgroup_mutex under active protection but cgroup also needs to be able
to access cgroup hierarchies and cftypes to determine which
kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
cgroup_mutex and kernfs active protection and used to protect the
hierarchy and cftypes.  While this worked, it added a lot of double
lockings and was generally cumbersome.

kernfs provides a mechanism to opt out of active protection and cgroup
was already using it for removal and subtree_control.  There's no
reason to mix both methods of avoiding circular locking dependency and
the preceding cgroup_kn_lock_live() changes applied it to all relevant
cgroup kernfs operations making it unnecessary to nest cgroup_mutex
under kernfs active protection.  The previous patch reversed the
original lock ordering and put cgroup_mutex above kernfs active
protection.

After these changes, all cgroup_tree_mutex usages are now accompanied
by cgroup_mutex making the former completely redundant.  This patch
removes cgroup_tree_mutex and all its usages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo 01f6474ce0 cgroup: nest kernfs active protection under cgroup_mutex
After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
longer nested below kernfs active protection.  The two don't have any
relationship now.

This patch nests kernfs active protection under cgroup_mutex.  All
cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
temporary cgroup_mutex releases over kernfs operations are removed,
and cgroup_add/rm_cftypes() grab both mutexes.

This makes cgroup_tree_mutex redundant, which will be removed by the
next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo e76ecaeef6 cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods
Make __cgroup_procs_write() and cgroup_release_agent_write() use
cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
cgroup_lock_live_group().  This puts the operations under both
cgroup_tree_mutex and cgroup_mutex protection without circular
dependency from kernfs active protection.  Also, this means that
cgroup_mutex is no longer nested below kernfs active protection.
There is no longer any place where the two locks interact.

This leaves cgroup_lock_live_group() without any user.  Removed.

This will help simplifying cgroup locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo a9746d8da7 cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock()
cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
share the logic to break active protection so that they can grab
cgroup_tree_mutex which nests above active protection and/or remove
self.  Factor out this logic into cgroup_kn_lock_live() and
cgroup_kn_unlock().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo cfc79d5bec cgroup: move cgroup->kn->priv clearing to cgroup_rmdir()
The ->priv field of a cgroup directory kernfs_node points back to the
cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
non-kernfs accesses from css_tryget_from_dir() and
cgroupstats_build().

As these are only applicable to cgroups which finished creation
successfully and fully initialized cgroups are always removed by
cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().

This will help simplifying cgroup locking and shouldn't introduce any
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo ddab2b6e0e cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write()
Move cgroup_lock_live_group() invocation upwards to right below
cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
the planned locking simplification.

This doesn't make any userland-visible behavioral changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo b3bfd983ca cgroup: collapse cgroup_create() into croup_mkdir()
cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
the latter into the former.  This will help simplifying locking.
While at it, remove now stale comment about inode locking.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo ba0f4d7615 cgroup: reorganize cgroup_create()
Reorganize cgroup_create() so that all paths share unlock out path.

* All err_* labels are renamed to out_* as they're now shared by both
  success and failure paths.

* @err renamed to @ret for the similar reason as above and so that
  it's more consistent with other functions.

* cgroup memory allocation moved after locking so that freeing failed
  cgroup happens before unlocking.  While this moves more code inside
  critical section, memory allocations inside cgroup locking are
  already pretty common and this is unlikely to make any noticeable
  difference.

* While at it, replace a stray @parent->root dereference with @root.

This reorganization will help simplifying locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo b7fc5ad235 cgroup: remove cgroup->control_kn
Now that cgroup_subtree_control_write() has access to the associated
kernfs_open_file and thus the kernfs_node, there's no need to cache it
in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
@of->kn directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo acbef755f4 cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write()
cgroup_tasks_write() and cgroup_procs_write() are currently using
cftype->write_u64().  This patch converts them to use cftype->write()
instead.  This allows access to the associated kernfs_open_file which
will be necessary to implement the planned kernfs active protection
manipulation for these files.

This shifts buffer parsing to attach_task_by_pid() and makes it return
@nbytes on success.  Let's rename it to __cgroup_procs_write() to
clearly indicate that this is a write handler implementation.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo 6770c64e5c cgroup: replace cftype->trigger() with cftype->write()
cftype->trigger() is pointless.  It's trivial to ignore the input
buffer from a regular ->write() operation.  Convert all ->trigger()
users to ->write() and remove ->trigger().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
2014-05-13 12:16:21 -04:00
Tejun Heo 451af504df cgroup: replace cftype->write_string() with cftype->write()
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts.  The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
  respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically.  Trim if necessary.  Note that
  blkcg and netprio don't need this as the parsers already handle
  whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion.  Converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
2014-05-13 12:16:21 -04:00
Tejun Heo b41686401e cgroup: implement cftype->write()
During the recent conversion to kernfs, cftype's seq_file operations
are updated so that they are directly mapped to kernfs operations and
thus can fully access the associated kernfs and cgroup contexts;
however, write path hasn't seen similar updates and none of the
existing write operations has access to, for example, the associated
kernfs_open_file.

Let's introduce a new operation cftype->write() which maps directly to
the kernfs write operation and has access to all the arguments and
contexts.  This will replace ->write_string() and ->trigger() and ease
manipulation of kernfs active protection from cgroup file operations.

Two accessors - of_cft() and of_css() - are introduced to enable
accessing the associated cgroup context from cftype->write() which
only takes kernfs_open_file for the context information.  The
accessors for seq_file operations - seq_cft() and seq_css() - are
rewritten to wrap the of_ accessors.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:21 -04:00
Tejun Heo ec903c0c85 cgroup: rename css_tryget*() to css_tryget_online*()
Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero.  cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero.  Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir().  This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget().  Update
    accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2014-05-13 12:11:01 -04:00
Tejun Heo 46cfeb043b cgroup: use release_agent_path_lock in cgroup_release_agent_show()
release_path is now protected by release_agent_path_lock to allow
accessing it without grabbing cgroup_mutex; however,
cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
convert it to release_agent_path_lock so that we don't have to worry
about this one for the planned locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00
Tejun Heo 7d331fa985 cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write()
After waiting for a child to finish offline,
cgroup_subtree_control_write() jumps up to retry from after the input
parsing and active protection breaking.  This retry makes the
scheduled locking update - removal of cgroup_tree_mutex - more
difficult.  Let's simplify it by returning with restart_syscall() for
retries.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00
Tejun Heo d37167ab7b cgroup: update and fix parsing of "cgroup.subtree_control"
I was confused that strsep() was equivalent to strtok_r() in skipping
over consecutive delimiters.  strsep() just splits at the first
occurrence of one of the delimiters which makes the parsing very
inflexible, which makes allowing multiple whitespace chars as
delimters kinda moot.  Let's just be consistently strict and require
list of tokens separated by spaces.  This is what
Documentation/cgroups/unified-hierarchy.txt describes too.

Also, parsing may access beyond the end of the string if the string
ends with spaces or is zero-length.  Make sure it skips zero-length
tokens.  Note that this also ensures that the parser doesn't puke on
multiple consecutive spaces.

v2: Add zero-length token skipping.

v3: Added missing space after "==".  Spotted by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 0ab7a60dea cgroup: css_release() shouldn't clear cgroup->subsys[]
c1a71504e9 ("cgroup: don't recycle cgroup id until all csses' have
been destroyed") made cgroup ID persist until a cgroup is released and
add cgroup->subsys[] clearing to css_release() so that css_from_id()
doesn't return a css which has already been released which happens
before cgroup release; however, the right change here was updating
offline_css() to clear cgroup->subsys[] which was done by e329780310
("cgroup: cgroup->subsys[] should be cleared after the css is
offlined") instead of clearing it from css_release().

We're now clearing cgroup->subsys[] twice.  This is okay for
traditional hierarchies as a css's lifetime is the same as its
cgroup's; however, this confuses unified hierarchy and turning on and
off a controller repeatedly using "cgroup.subtree_control" can lead to
an oops like the following which happens because cgroup->subsys[] is
incorrectly cleared asynchronously by css_release().

 BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
 IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
 PGD 1170d067 PUD f0ab067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
 RIP: 0010:[<ffffffff81130c11>]  [<ffffffff81130c11>] kill_css+0x21/0x1c0
 RSP: 0018:ffff88000e199dc8  EFLAGS: 00010202
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
 RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
 RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
 R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
 R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
 FS:  00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
 Stack:
  ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
  ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
  ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
 Call Trace:
  [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
  [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
  [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
  [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
  [<ffffffff811f35ad>] SyS_write+0x4d/0xc0
  [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
 Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
 RIP  [<ffffffff81130c11>] kill_css+0x21/0x1c0
  RSP <ffff88000e199dc8>
 CR2: 0000000000000008
 ---[ end trace e7aae1f877c4e1b4 ]---

Remove the unnecessary cgroup->subsys[] clearing from css_release().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 54504e977c cgroup: cgroup_idr_lock should be bh
cgroup_idr_remove() can be invoked from bh leading to lockdep
detecting possible AA deadlock (IN_BH/ON_BH).  Make the lock bh-safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 0cee8b7786 cgroup: fix offlining child waiting in cgroup_subtree_control_write()
cgroup_subtree_control_write() waits for offline to complete
child-by-child before enabling a controller; however, it has a couple
bugs.

* It doesn't initialize the wait_queue_t.  This can lead to infinite
  hang on the following schedule() among other things.

* It forgets to pin the child before releasing cgroup_tree_mutex and
  performing schedule().  The child may already be gone by the time it
  wakes up and invokes finish_wait().  Pin the child being waited on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo f21a4f7594 Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16
Pull to receive e37a06f109 ("cgroup: fix the retry path of
cgroup_mount()") to avoid unnecessary conflicts with planned
cgroup_tree_mutex removal and also to be able to remove the temp fix
added by 36c38fb714 ("blkcg: use trylock on blkcg_pol_mutex in
blkcg_reset_stats()") afterwards.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-13 11:30:04 -04:00
Tejun Heo 36e9d2ebcc cgroup: fix rcu_read_lock() leak in update_if_frozen()
While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
2014-05-13 11:28:30 -04:00
Tejun Heo d39ea871c3 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16
Pull to receive percpu_ref_tryget[_live]() changes.  Planned cgroup
changes will make use of them.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-13 11:27:24 -04:00
Tejun Heo e5ced8ebb1 cgroup_freezer: replace freezer->lock with freezer_mutex
After 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem.  I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
errors like the following on freezer state reads and transitions.

  BUG: sleeping function called from invalid context at /work
 /os/work/kernel/locking/rwsem.c:20
  in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
  5 locks held by bash/462:
   #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
   #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
   #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
   #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
   #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
  Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460

  CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
   ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
   0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
  Call Trace:
   [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
   [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
   [<ffffffff81d05974>] down_read+0x24/0x60
   [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
   [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
   [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
   [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
   [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
   [<ffffffff811f121d>] SyS_write+0x4d/0xc0
   [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup.  In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex.  This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward.  The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
  freezer_mutex.  update_if_frozen() race no longer exists and the
  comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
  the new task_css_is_root() without doing rcu_read_lock/unlock().  If
  not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
  the whole operation and pin the css while iterating so that each
  descendant processing happens in sleepable context.

Fixes: 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:31 -04:00
Tejun Heo 5024ae29cd cgroup: introduce task_css_is_root()
Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers.  No need to export
    init_css_set.  Pointed out by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:27 -04:00
Tejun Heo 4fb6e25049 percpu-refcount: implement percpu_ref_tryget()
Implement percpu_ref_tryget() which fails if the refcnt already
reached zero.  Note that this is different from the recently renamed
percpu_ref_tryget_live() which fails if the refcnt has been killed and
is draining the remaining references.  percpu_ref_tryget() succeeds on
a killed refcnt as long as its current refcnt is above zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <kmo@daterainc.com>
2014-05-09 15:42:35 -04:00
Tejun Heo 2070d50e1c percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live()
percpu_ref_tryget() is different from the usual tryget semantics in
that it fails if the refcnt is in its dying stage even if the refcnt
hasn't reached zero yet.  We're about to introduce the more
conventional tryget and the current one has only one user.  Let's
rename it to percpu_ref_tryget_live() so that it explicitly signifies
the peculiarities of its semantics.

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <kmo@daterainc.com>
2014-05-09 15:42:15 -04:00
Tejun Heo 2b53f41fa8 cgroup: remove unused CGRP_SANE_BEHAVIOR
This cgroup flag has never been used.  Only CGRP_ROOT_SANE_BEHAVIOR is
used.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-07 09:21:56 -04:00