Commit Graph

441284 Commits

Author SHA1 Message Date
Tejun Heo 25e15d8350 cgroup: bounce css release through css->destroy_work
css release is planned to do more and would require process context.
Bounce it through css->destroy_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:02 -04:00
Tejun Heo 249f3468a2 cgroup: remove cgroup_destory_css_killed()
cgroup_destroy_css_killed() is cgroup destruction stage which happens
after all csses are offlined.  After the recent updates, it no longer
does anything other than putting the base reference.  This patch
removes the function and makes cgroup_destroy_locked() put the base
ref at the end isntead.

This also makes cgroup->nr_css unnecessary.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 4e4e284723 cgroup: move cgroup->sibling unlinking to cgroup_put()
Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
cgroup_put().  This is later but still before the RCU grace period, so
it doesn't break css_next_child() although there now is a larger
window in which a dead cgroup is visible during css iteration.  As css
iteration always could have included offline csses, this doesn't
affect correctness; however, it does make css_next_child() fall back
to reiterting mode more often.  This also makes cgroup_put() directly
take cgroup_mutex, which limits where it can be called from.  These
are not immediately problematic and will be dealt with later.

This change enables simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 9e4173e1f2 cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked()
Currently, check_for_release() on the parent of a destroyed cgroup is
invoked from cgroup_destroy_css_killed().  This is because this is
where the destroyed cgroup can be removed from the parent's children
list.  check_for_release() tests the emptiness of the list directly,
so invoking it before removing the cgroup from the list makes it think
that the parent still has children even when it no longer does.

This patch updates check_for_release() to use
cgroup_has_live_children() instead of directly testing ->children
emptiness and moves check_for_release(parent) earlier to the end of
cgroup_destroy_locked().  As cgroup_has_live_children() ignores
cgroups marked DEAD, check_for_release() functions correctly as long
as it's called after asserting DEAD.

This makes release notification slightly more timely and more
importantly enables further simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo cbc125efad cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked()
We're expecting another user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:01 -04:00
Tejun Heo 9d800df12d cgroup: rename cgroup->dummy_css to ->self and move it to the top
cgroup->dummy_css is used as the placeholder css when performing css
oriended operations on the cgroup.  We're gonna shift more cgroup
management to this css.  Let's rename it to ->self and move it to the
top.

This is pure rename and field relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Tejun Heo a015edd26e cgroup: use restart_syscall() for mount retries
cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
which is being destroyed.  The retry currently loops inside
cgroup_mount() proper.  This patch makes it return with
restart_syscall() instead so that retry travels out to userland
boundary.

This slightly simplifies the logic and more importantly makes the
retry logic behave better when the wait for some reason becomes
lengthy or infinite by allowing the operation to be suspended or
terminated from userland.

v2: The original patch forgot to free memory allocated for @opts.
    Fixed.  Caught by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-14 09:15:00 -04:00
Tejun Heo 8353da1f91 cgroup: remove cgroup_tree_mutex
cgroup_tree_mutex was introduced to work around the circular
dependency between cgroup_mutex and kernfs active protection - some
kernfs file and directory operations needed cgroup_mutex putting
cgroup_mutex under active protection but cgroup also needs to be able
to access cgroup hierarchies and cftypes to determine which
kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
cgroup_mutex and kernfs active protection and used to protect the
hierarchy and cftypes.  While this worked, it added a lot of double
lockings and was generally cumbersome.

kernfs provides a mechanism to opt out of active protection and cgroup
was already using it for removal and subtree_control.  There's no
reason to mix both methods of avoiding circular locking dependency and
the preceding cgroup_kn_lock_live() changes applied it to all relevant
cgroup kernfs operations making it unnecessary to nest cgroup_mutex
under kernfs active protection.  The previous patch reversed the
original lock ordering and put cgroup_mutex above kernfs active
protection.

After these changes, all cgroup_tree_mutex usages are now accompanied
by cgroup_mutex making the former completely redundant.  This patch
removes cgroup_tree_mutex and all its usages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo 01f6474ce0 cgroup: nest kernfs active protection under cgroup_mutex
After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
longer nested below kernfs active protection.  The two don't have any
relationship now.

This patch nests kernfs active protection under cgroup_mutex.  All
cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
temporary cgroup_mutex releases over kernfs operations are removed,
and cgroup_add/rm_cftypes() grab both mutexes.

This makes cgroup_tree_mutex redundant, which will be removed by the
next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo e76ecaeef6 cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods
Make __cgroup_procs_write() and cgroup_release_agent_write() use
cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
cgroup_lock_live_group().  This puts the operations under both
cgroup_tree_mutex and cgroup_mutex protection without circular
dependency from kernfs active protection.  Also, this means that
cgroup_mutex is no longer nested below kernfs active protection.
There is no longer any place where the two locks interact.

This leaves cgroup_lock_live_group() without any user.  Removed.

This will help simplifying cgroup locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:23 -04:00
Tejun Heo a9746d8da7 cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock()
cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
share the logic to break active protection so that they can grab
cgroup_tree_mutex which nests above active protection and/or remove
self.  Factor out this logic into cgroup_kn_lock_live() and
cgroup_kn_unlock().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo cfc79d5bec cgroup: move cgroup->kn->priv clearing to cgroup_rmdir()
The ->priv field of a cgroup directory kernfs_node points back to the
cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
non-kernfs accesses from css_tryget_from_dir() and
cgroupstats_build().

As these are only applicable to cgroups which finished creation
successfully and fully initialized cgroups are always removed by
cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().

This will help simplifying cgroup locking and shouldn't introduce any
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo ddab2b6e0e cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write()
Move cgroup_lock_live_group() invocation upwards to right below
cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
the planned locking simplification.

This doesn't make any userland-visible behavioral changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo b3bfd983ca cgroup: collapse cgroup_create() into croup_mkdir()
cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
the latter into the former.  This will help simplifying locking.
While at it, remove now stale comment about inode locking.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo ba0f4d7615 cgroup: reorganize cgroup_create()
Reorganize cgroup_create() so that all paths share unlock out path.

* All err_* labels are renamed to out_* as they're now shared by both
  success and failure paths.

* @err renamed to @ret for the similar reason as above and so that
  it's more consistent with other functions.

* cgroup memory allocation moved after locking so that freeing failed
  cgroup happens before unlocking.  While this moves more code inside
  critical section, memory allocations inside cgroup locking are
  already pretty common and this is unlikely to make any noticeable
  difference.

* While at it, replace a stray @parent->root dereference with @root.

This reorganization will help simplifying locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:19:22 -04:00
Tejun Heo b7fc5ad235 cgroup: remove cgroup->control_kn
Now that cgroup_subtree_control_write() has access to the associated
kernfs_open_file and thus the kernfs_node, there's no need to cache it
in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
@of->kn directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo acbef755f4 cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write()
cgroup_tasks_write() and cgroup_procs_write() are currently using
cftype->write_u64().  This patch converts them to use cftype->write()
instead.  This allows access to the associated kernfs_open_file which
will be necessary to implement the planned kernfs active protection
manipulation for these files.

This shifts buffer parsing to attach_task_by_pid() and makes it return
@nbytes on success.  Let's rename it to __cgroup_procs_write() to
clearly indicate that this is a write handler implementation.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:22 -04:00
Tejun Heo 6770c64e5c cgroup: replace cftype->trigger() with cftype->write()
cftype->trigger() is pointless.  It's trivial to ignore the input
buffer from a regular ->write() operation.  Convert all ->trigger()
users to ->write() and remove ->trigger().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
2014-05-13 12:16:21 -04:00
Tejun Heo 451af504df cgroup: replace cftype->write_string() with cftype->write()
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts.  The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
  respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically.  Trim if necessary.  Note that
  blkcg and netprio don't need this as the parsers already handle
  whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion.  Converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
2014-05-13 12:16:21 -04:00
Tejun Heo b41686401e cgroup: implement cftype->write()
During the recent conversion to kernfs, cftype's seq_file operations
are updated so that they are directly mapped to kernfs operations and
thus can fully access the associated kernfs and cgroup contexts;
however, write path hasn't seen similar updates and none of the
existing write operations has access to, for example, the associated
kernfs_open_file.

Let's introduce a new operation cftype->write() which maps directly to
the kernfs write operation and has access to all the arguments and
contexts.  This will replace ->write_string() and ->trigger() and ease
manipulation of kernfs active protection from cgroup file operations.

Two accessors - of_cft() and of_css() - are introduced to enable
accessing the associated cgroup context from cftype->write() which
only takes kernfs_open_file for the context information.  The
accessors for seq_file operations - seq_cft() and seq_css() - are
rewritten to wrap the of_ accessors.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:16:21 -04:00
Tejun Heo ec903c0c85 cgroup: rename css_tryget*() to css_tryget_online*()
Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero.  cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero.  Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir().  This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget().  Update
    accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2014-05-13 12:11:01 -04:00
Tejun Heo 46cfeb043b cgroup: use release_agent_path_lock in cgroup_release_agent_show()
release_path is now protected by release_agent_path_lock to allow
accessing it without grabbing cgroup_mutex; however,
cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
convert it to release_agent_path_lock so that we don't have to worry
about this one for the planned locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00
Tejun Heo 7d331fa985 cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write()
After waiting for a child to finish offline,
cgroup_subtree_control_write() jumps up to retry from after the input
parsing and active protection breaking.  This retry makes the
scheduled locking update - removal of cgroup_tree_mutex - more
difficult.  Let's simplify it by returning with restart_syscall() for
retries.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:11:00 -04:00
Tejun Heo d37167ab7b cgroup: update and fix parsing of "cgroup.subtree_control"
I was confused that strsep() was equivalent to strtok_r() in skipping
over consecutive delimiters.  strsep() just splits at the first
occurrence of one of the delimiters which makes the parsing very
inflexible, which makes allowing multiple whitespace chars as
delimters kinda moot.  Let's just be consistently strict and require
list of tokens separated by spaces.  This is what
Documentation/cgroups/unified-hierarchy.txt describes too.

Also, parsing may access beyond the end of the string if the string
ends with spaces or is zero-length.  Make sure it skips zero-length
tokens.  Note that this also ensures that the parser doesn't puke on
multiple consecutive spaces.

v2: Add zero-length token skipping.

v3: Added missing space after "==".  Spotted by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 0ab7a60dea cgroup: css_release() shouldn't clear cgroup->subsys[]
c1a71504e9 ("cgroup: don't recycle cgroup id until all csses' have
been destroyed") made cgroup ID persist until a cgroup is released and
add cgroup->subsys[] clearing to css_release() so that css_from_id()
doesn't return a css which has already been released which happens
before cgroup release; however, the right change here was updating
offline_css() to clear cgroup->subsys[] which was done by e329780310
("cgroup: cgroup->subsys[] should be cleared after the css is
offlined") instead of clearing it from css_release().

We're now clearing cgroup->subsys[] twice.  This is okay for
traditional hierarchies as a css's lifetime is the same as its
cgroup's; however, this confuses unified hierarchy and turning on and
off a controller repeatedly using "cgroup.subtree_control" can lead to
an oops like the following which happens because cgroup->subsys[] is
incorrectly cleared asynchronously by css_release().

 BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
 IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
 PGD 1170d067 PUD f0ab067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
 RIP: 0010:[<ffffffff81130c11>]  [<ffffffff81130c11>] kill_css+0x21/0x1c0
 RSP: 0018:ffff88000e199dc8  EFLAGS: 00010202
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
 RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
 RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
 R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
 R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
 FS:  00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
 Stack:
  ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
  ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
  ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
 Call Trace:
  [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
  [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
  [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
  [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
  [<ffffffff811f35ad>] SyS_write+0x4d/0xc0
  [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
 Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
 RIP  [<ffffffff81130c11>] kill_css+0x21/0x1c0
  RSP <ffff88000e199dc8>
 CR2: 0000000000000008
 ---[ end trace e7aae1f877c4e1b4 ]---

Remove the unnecessary cgroup->subsys[] clearing from css_release().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 54504e977c cgroup: cgroup_idr_lock should be bh
cgroup_idr_remove() can be invoked from bh leading to lockdep
detecting possible AA deadlock (IN_BH/ON_BH).  Make the lock bh-safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo 0cee8b7786 cgroup: fix offlining child waiting in cgroup_subtree_control_write()
cgroup_subtree_control_write() waits for offline to complete
child-by-child before enabling a controller; however, it has a couple
bugs.

* It doesn't initialize the wait_queue_t.  This can lead to infinite
  hang on the following schedule() among other things.

* It forgets to pin the child before releasing cgroup_tree_mutex and
  performing schedule().  The child may already be gone by the time it
  wakes up and invokes finish_wait().  Pin the child being waited on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 12:10:59 -04:00
Tejun Heo f21a4f7594 Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16
Pull to receive e37a06f109 ("cgroup: fix the retry path of
cgroup_mount()") to avoid unnecessary conflicts with planned
cgroup_tree_mutex removal and also to be able to remove the temp fix
added by 36c38fb714 ("blkcg: use trylock on blkcg_pol_mutex in
blkcg_reset_stats()") afterwards.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-13 11:30:04 -04:00
Tejun Heo 36e9d2ebcc cgroup: fix rcu_read_lock() leak in update_if_frozen()
While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
2014-05-13 11:28:30 -04:00
Tejun Heo d39ea871c3 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16
Pull to receive percpu_ref_tryget[_live]() changes.  Planned cgroup
changes will make use of them.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-13 11:27:24 -04:00
Tejun Heo e5ced8ebb1 cgroup_freezer: replace freezer->lock with freezer_mutex
After 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem.  I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
errors like the following on freezer state reads and transitions.

  BUG: sleeping function called from invalid context at /work
 /os/work/kernel/locking/rwsem.c:20
  in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
  5 locks held by bash/462:
   #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
   #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
   #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
   #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
   #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
  Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460

  CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
   ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
   0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
  Call Trace:
   [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
   [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
   [<ffffffff81d05974>] down_read+0x24/0x60
   [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
   [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
   [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
   [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
   [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
   [<ffffffff811f121d>] SyS_write+0x4d/0xc0
   [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup.  In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex.  This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward.  The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
  freezer_mutex.  update_if_frozen() race no longer exists and the
  comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
  the new task_css_is_root() without doing rcu_read_lock/unlock().  If
  not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
  the whole operation and pin the css while iterating so that each
  descendant processing happens in sleepable context.

Fixes: 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:31 -04:00
Tejun Heo 5024ae29cd cgroup: introduce task_css_is_root()
Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers.  No need to export
    init_css_set.  Pointed out by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13 11:26:27 -04:00
Tejun Heo 4fb6e25049 percpu-refcount: implement percpu_ref_tryget()
Implement percpu_ref_tryget() which fails if the refcnt already
reached zero.  Note that this is different from the recently renamed
percpu_ref_tryget_live() which fails if the refcnt has been killed and
is draining the remaining references.  percpu_ref_tryget() succeeds on
a killed refcnt as long as its current refcnt is above zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <kmo@daterainc.com>
2014-05-09 15:42:35 -04:00
Tejun Heo 2070d50e1c percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live()
percpu_ref_tryget() is different from the usual tryget semantics in
that it fails if the refcnt is in its dying stage even if the refcnt
hasn't reached zero yet.  We're about to introduce the more
conventional tryget and the current one has only one user.  Let's
rename it to percpu_ref_tryget_live() so that it explicitly signifies
the peculiarities of its semantics.

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <kmo@daterainc.com>
2014-05-09 15:42:15 -04:00
Tejun Heo 2b53f41fa8 cgroup: remove unused CGRP_SANE_BEHAVIOR
This cgroup flag has never been used.  Only CGRP_ROOT_SANE_BEHAVIOR is
used.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-07 09:21:56 -04:00
Fabian Frederick 12d3089c19 kernel/cpuset.c: convert printk to pr_foo()
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-06 07:31:14 -04:00
Fabian Frederick fc34ac1dc5 kernel/cpuset.c: kernel-doc fixes
This patch also converts seq_printf to seq_puts

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-06 07:31:14 -04:00
Fabian Frederick 60106946ca kernel/cgroup.c: fix 2 kernel-doc warnings
Fix typo and variable name.

tj: Updated @cgrp argument description in cgroup_destroy_css_killed()

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-05 14:33:04 -04:00
Tejun Heo 36c38fb714 blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
which nests above both the kernfs s_active protection and cgroup_mutex
is added to synchronize cgroup file type operations as cgroup_mutex
needed to be grabbed from some file operations and thus can't be put
above s_active protection.

While this arrangement mostly worked for cgroup, this triggered the
following lockdep warning.

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
  -------------------------------------------------------
  trinity-c173/9024 is trying to acquire lock:
  (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

  but task is already holding lock:
  (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (s_active#89){++++.+}:
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
  kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
  cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
  cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
  rebind_subsystems (kernel/cgroup.c:1144)
  cgroup_setup_root (kernel/cgroup.c:1568)
  cgroup_mount (kernel/cgroup.c:1716)
  mount_fs (fs/super.c:1094)
  vfs_kern_mount (fs/namespace.c:899)
  do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
  SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
  tracesys (arch/x86/kernel/entry_64.S:746)

  -> #1 (cgroup_tree_mutex){+.+.+.}:
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
  blkcg_policy_register (block/blk-cgroup.c:1106)
  throtl_init (block/blk-throttle.c:1694)
  do_one_initcall (init/main.c:789)
  kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
  kernel_init (init/main.c:935)
  ret_from_fork (arch/x86/kernel/entry_64.S:552)

  -> #0 (blkcg_pol_mutex){+.+.+.}:
  __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
  cgroup_file_write (kernel/cgroup.c:2714)
  kernfs_fop_write (fs/kernfs/file.c:295)
  vfs_write (fs/read_write.c:532)
  SyS_write (fs/read_write.c:584 fs/read_write.c:576)
  tracesys (arch/x86/kernel/entry_64.S:746)

  other info that might help us debug this:

  Chain exists of:
  blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(s_active#89);
				 lock(cgroup_tree_mutex);
				 lock(s_active#89);
    lock(blkcg_pol_mutex);

   *** DEADLOCK ***

  4 locks held by trinity-c173/9024:
  #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
  #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
  #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
  #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

  stack backtrace:
  CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
   ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
   ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
   ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
  Call Trace:
  dump_stack (lib/dump_stack.c:52)
  print_circular_bug (kernel/locking/lockdep.c:1216)
  __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
  cgroup_file_write (kernel/cgroup.c:2714)
  kernfs_fop_write (fs/kernfs/file.c:295)
  vfs_write (fs/read_write.c:532)
  SyS_write (fs/read_write.c:584 fs/read_write.c:576)

This is a highly unlikely but valid circular dependency between "echo
1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
through further locking update which will remove this complication but
for now let's use trylock on blkcg_pol_mutex and retry the file
operation if the trylock fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
2014-05-05 13:48:18 -04:00
Aristeu Rozanski d2c2b11cfa device_cgroup: check if exception removal is allowed
[PATCH v3 1/2] device_cgroup: check if exception removal is allowed

When the device cgroup hierarchy was introduced in
	bd2953ebbb - devcg: propagate local changes down the hierarchy

a specific case was overlooked. Consider the hierarchy bellow:

	A	default policy: ALLOW, exceptions will deny access
	 \
	  B	default policy: ALLOW, exceptions will deny access

There's no need to verify when an new exception is added to B because
in this case exceptions will deny access to further devices, which is
always fine. Hierarchy in device cgroup only makes sure B won't have
more access than A.

But when an exception is removed (by writing devices.allow), it isn't
checked if the user is in fact removing an inherited exception from A,
thus giving more access to B.

Example:

	# echo 'a' >A/devices.allow
	# echo 'c 1:3 rw' >A/devices.deny
	# echo $$ >A/B/tasks
	# echo >/dev/null
	-bash: /dev/null: Operation not permitted
	# echo 'c 1:3 w' >A/B/devices.allow
	# echo >/dev/null
	#

This shouldn't be allowed and this patch fixes it by making sure to never allow
exceptions in this case to be removed if the exception is partially or fully
present on the parent.

v3: missing '*' in function description
v2: improved log message and formatting fixes

Cc: cgroups@vger.kernel.org
Cc: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-05 11:20:12 -04:00
Aristeu Rozanski f5f3cf6f7e device_cgroup: fix the comment format for recently added functions
Moving more extensive explanations to the end of the comment.

Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-04 15:21:09 -04:00
Tejun Heo 15a4c835e4 cgroup, memcg: implement css->id and convert css_from_id() to use it
Until now, cgroup->id has been used to identify all the associated
csses and css_from_id() takes cgroup ID and returns the matching css
by looking up the cgroup and then dereferencing the css associated
with it; however, now that the lifetimes of cgroup and css are
separate, this is incorrect and breaks on the unified hierarchy when a
controller is disabled and enabled back again before the previous
instance is released.

This patch adds css->id which is a subsystem-unique ID and converts
css_from_id() to look up by the new css->id instead.  memcg is the
only user of css_from_id() and also converted to use css->id instead.

For traditional hierarchies, this shouldn't make any functional
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo ddfcadab35 cgroup: update init_css() into init_and_link_css()
init_css() takes the cgroup the new css belongs to as an argument and
initializes the new css's ->cgroup and ->parent pointers but doesn't
acquire the matching reference counts.  After the previous patch,
create_css() puts init_css() and reference acquisition right next to
each other.  Let's move reference acquistion into init_css() and
rename the function to init_and_link_css().  This makes sense and is
easier to follow.  This makes the root csses to hold a reference on
cgrp_dfl_root.cgrp, which is harmless.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo a2bed8209a cgroup: use RCU free in create_css() failure path
Currently, when create_css() fails in the middle, the half-initialized
css is freed by invoking cgroup_subsys->css_free() directly.  This
patch updates the function so that it invokes RCU free path instead.
As the RCU free path puts the parent css and owning cgroup, their
references are now acquired right after a new css is successfully
allocated.

This doesn't make any visible difference now but is to enable
implementing css->id and RCU protected lookup by such IDs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo 6fa4918d03 cgroup: protect cgroup_root->cgroup_idr with a spinlock
Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
ends up requiring cgroup_put() to be invoked under sleepable context.
This is okay for now but is an unusual requirement and we'll soon add
css->id which will have the same problem but won't be able to simply
grab cgroup_mutex as removal will have to happen from css_release()
which can't sleep.

Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
which protects the idr operations with the lock and use them for
cgroup_root->cgroup_idr.  cgroup_put() no longer needs to grab
cgroup_mutex and css_from_id() is updated to always require RCU read
lock instead of either RCU read lock or cgroup_mutex, which doesn't
affect the existing users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Tejun Heo 7d699ddb2b cgroup, memcg: allocate cgroup ID from 1
Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.

It's reasonable to reserve 0 for special purposes.  This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Tejun Heo 69dfa00ccb cgroup: make flags and subsys_masks unsigned int
There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks.  This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.

This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Joe Perches ed3d261b53 cgroup: Use more current logging style
Use pr_fmt and remove embedded prefixes.
Realign modified multi-line statements to open parenthesis.
Convert embedded function name to "%s: ", __func__

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-25 18:28:03 -04:00
Jianyu Zhan a2a1f9eaf9 cgroup: replace pr_warning with preferred pr_warn
As suggested by scripts/checkpatch.pl, substitude all pr_warning()
with pr_warn().

No functional change.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-25 18:28:03 -04:00
Jianyu Zhan f8719ccf7b cgroup: remove orphaned cgroup_pidlist_seq_operations
6612f05b88 ("cgroup: unify pidlist and other file handling")
has removed the only user of cgroup_pidlist_seq_operations :
cgroup_pidlist_open().

This patch removes it.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-25 18:28:03 -04:00