linux

Commit Graph

Author	SHA1	Message	Date
Tejun Heo	7d331fa985	cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write() After waiting for a child to finish offline, cgroup_subtree_control_write() jumps up to retry from after the input parsing and active protection breaking. This retry makes the scheduled locking update - removal of cgroup_tree_mutex - more difficult. Let's simplify it by returning with restart_syscall() for retries. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:11:00 -04:00
Tejun Heo	d37167ab7b	cgroup: update and fix parsing of "cgroup.subtree_control" I was confused that strsep() was equivalent to strtok_r() in skipping over consecutive delimiters. strsep() just splits at the first occurrence of one of the delimiters which makes the parsing very inflexible, which makes allowing multiple whitespace chars as delimters kinda moot. Let's just be consistently strict and require list of tokens separated by spaces. This is what Documentation/cgroups/unified-hierarchy.txt describes too. Also, parsing may access beyond the end of the string if the string ends with spaces or is zero-length. Make sure it skips zero-length tokens. Note that this also ensures that the parser doesn't puke on multiple consecutive spaces. v2: Add zero-length token skipping. v3: Added missing space after "==". Spotted by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	0ab7a60dea	cgroup: css_release() shouldn't clear cgroup->subsys[] `c1a71504e9` ("cgroup: don't recycle cgroup id until all csses' have been destroyed") made cgroup ID persist until a cgroup is released and add cgroup->subsys[] clearing to css_release() so that css_from_id() doesn't return a css which has already been released which happens before cgroup release; however, the right change here was updating offline_css() to clear cgroup->subsys[] which was done by `e329780310` ("cgroup: cgroup->subsys[] should be cleared after the css is offlined") instead of clearing it from css_release(). We're now clearing cgroup->subsys[] twice. This is okay for traditional hierarchies as a css's lifetime is the same as its cgroup's; however, this confuses unified hierarchy and turning on and off a controller repeatedly using "cgroup.subtree_control" can lead to an oops like the following which happens because cgroup->subsys[] is incorrectly cleared asynchronously by css_release(). BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08 IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0 PGD 1170d067 PUD f0ab067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000 RIP: 0010:[<ffffffff81130c11>] [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98 RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000 R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001 R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00 FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0 Stack: ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80 ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48 ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002 Call Trace: [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00 [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0 [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0 [<ffffffff811f35ad>] SyS_write+0x4d/0xc0 [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff RIP [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP <ffff88000e199dc8> CR2: 0000000000000008 ---[ end trace e7aae1f877c4e1b4 ]--- Remove the unnecessary cgroup->subsys[] clearing from css_release(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	54504e977c	cgroup: cgroup_idr_lock should be bh cgroup_idr_remove() can be invoked from bh leading to lockdep detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	0cee8b7786	cgroup: fix offlining child waiting in cgroup_subtree_control_write() cgroup_subtree_control_write() waits for offline to complete child-by-child before enabling a controller; however, it has a couple bugs. * It doesn't initialize the wait_queue_t. This can lead to infinite hang on the following schedule() among other things. * It forgets to pin the child before releasing cgroup_tree_mutex and performing schedule(). The child may already be gone by the time it wakes up and invokes finish_wait(). Pin the child being waited on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	f21a4f7594	Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16 Pull to receive `e37a06f109` ("cgroup: fix the retry path of cgroup_mount()") to avoid unnecessary conflicts with planned cgroup_tree_mutex removal and also to be able to remove the temp fix added by `36c38fb714` ("blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()") afterwards. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-13 11:30:04 -04:00
Tejun Heo	36e9d2ebcc	cgroup: fix rcu_read_lock() leak in update_if_frozen() While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer: replace freezer->lock with freezer_mutex") introduced a bug in update_if_frozen() where it returns with rcu_read_lock() held. Fix it by adding rcu_read_unlock() before returning. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kbuild test robot <fengguang.wu@intel.com>	2014-05-13 11:28:30 -04:00
Tejun Heo	d39ea871c3	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16 Pull to receive percpu_ref_tryget[_live]() changes. Planned cgroup changes will make use of them. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-13 11:27:24 -04:00
Tejun Heo	e5ced8ebb1	cgroup_freezer: replace freezer->lock with freezer_mutex After `96d365e0b8` ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem"), css task iterators requires sleepable context as it may block on css_set_rwsem. I missed that cgroup_freezer was iterating tasks under IRQ-safe spinlock freezer->lock. This leads to errors like the following on freezer state reads and transitions. BUG: sleeping function called from invalid context at /work /os/work/kernel/locking/rwsem.c:20 in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash 5 locks held by bash/462: #0: (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0 #1: (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170 #2: (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170 #3: (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0 #4: (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0 Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460 CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000 ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740 0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246 Call Trace: [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a [<ffffffff810cf4f2>] __might_sleep+0x162/0x260 [<ffffffff81d05974>] down_read+0x24/0x60 [<ffffffff81133e87>] css_task_iter_start+0x27/0x70 [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130 [<ffffffff81135a16>] freezer_write+0xf6/0x1e0 [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230 [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f0756>] vfs_write+0xb6/0x1c0 [<ffffffff811f121d>] SyS_write+0x4d/0xc0 [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b freezer->lock used to be used in hot paths but that time is long gone and there's no reason for the lock to be IRQ-safe spinlock or even per-cgroup. In fact, given the fact that a cgroup may contain large number of tasks, it's not a good idea to iterate over them while holding IRQ-safe spinlock. Let's simplify locking by replacing per-cgroup freezer->lock with global freezer_mutex. This also makes the comments explaining the intricacies of policy inheritance and the locking around it as the states are protected by a common mutex. The conversion is mostly straight-forward. The followings are worth mentioning. * freezer_css_online() no longer needs double locking. * freezer_attach() now performs propagation simply while holding freezer_mutex. update_if_frozen() race no longer exists and the comment is removed. * freezer_fork() now tests whether the task is in root cgroup using the new task_css_is_root() without doing rcu_read_lock/unlock(). If not, it grabs freezer_mutex and performs the operation. * freezer_read() and freezer_change_state() grab freezer_mutex across the whole operation and pin the css while iterating so that each descendant processing happens in sleepable context. Fixes: `96d365e0b8` ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 11:26:31 -04:00
Tejun Heo	5024ae29cd	cgroup: introduce task_css_is_root() Determining the css of a task usually requires RCU read lock as that's the only thing which keeps the returned css accessible till its reference is acquired; however, testing whether a task belongs to the root can be performed without dereferencing the returned css by comparing the returned pointer against the root one in init_css_set[] which never changes. Implement task_css_is_root() which can be invoked in any context. This will be used by the scheduled cgroup_freezer change. v2: cgroup no longer supports modular controllers. No need to export init_css_set. Pointed out by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 11:26:27 -04:00
Tejun Heo	4fb6e25049	percpu-refcount: implement percpu_ref_tryget() Implement percpu_ref_tryget() which fails if the refcnt already reached zero. Note that this is different from the recently renamed percpu_ref_tryget_live() which fails if the refcnt has been killed and is draining the remaining references. percpu_ref_tryget() succeeds on a killed refcnt as long as its current refcnt is above zero. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Kent Overstreet <kmo@daterainc.com>	2014-05-09 15:42:35 -04:00
Tejun Heo	2070d50e1c	percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live() percpu_ref_tryget() is different from the usual tryget semantics in that it fails if the refcnt is in its dying stage even if the refcnt hasn't reached zero yet. We're about to introduce the more conventional tryget and the current one has only one user. Let's rename it to percpu_ref_tryget_live() so that it explicitly signifies the peculiarities of its semantics. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Kent Overstreet <kmo@daterainc.com>	2014-05-09 15:42:15 -04:00
Tejun Heo	2b53f41fa8	cgroup: remove unused CGRP_SANE_BEHAVIOR This cgroup flag has never been used. Only CGRP_ROOT_SANE_BEHAVIOR is used. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-07 09:21:56 -04:00
Fabian Frederick	12d3089c19	kernel/cpuset.c: convert printk to pr_foo() Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-06 07:31:14 -04:00
Fabian Frederick	fc34ac1dc5	kernel/cpuset.c: kernel-doc fixes This patch also converts seq_printf to seq_puts Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-06 07:31:14 -04:00
Fabian Frederick	60106946ca	kernel/cgroup.c: fix 2 kernel-doc warnings Fix typo and variable name. tj: Updated @cgrp argument description in cgroup_destroy_css_killed() Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-05 14:33:04 -04:00
Tejun Heo	36c38fb714	blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() During the recent conversion of cgroup to kernfs, cgroup_tree_mutex which nests above both the kernfs s_active protection and cgroup_mutex is added to synchronize cgroup file type operations as cgroup_mutex needed to be grabbed from some file operations and thus can't be put above s_active protection. While this arrangement mostly worked for cgroup, this triggered the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W ------------------------------------------------------- trinity-c173/9024 is trying to acquire lock: (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) but task is already holding lock: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (s_active#89){++++.+}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024) kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219) cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899) cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2)) rebind_subsystems (kernel/cgroup.c:1144) cgroup_setup_root (kernel/cgroup.c:1568) cgroup_mount (kernel/cgroup.c:1716) mount_fs (fs/super.c:1094) vfs_kern_mount (fs/namespace.c:899) do_mount (fs/namespace.c:2238 fs/namespace.c:2561) SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729) tracesys (arch/x86/kernel/entry_64.S:746) -> #1 (cgroup_tree_mutex){+.+.+.}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040) blkcg_policy_register (block/blk-cgroup.c:1106) throtl_init (block/blk-throttle.c:1694) do_one_initcall (init/main.c:789) kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003) kernel_init (init/main.c:935) ret_from_fork (arch/x86/kernel/entry_64.S:552) -> #0 (blkcg_pol_mutex){+.+.+.}: __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) tracesys (arch/x86/kernel/entry_64.S:746) other info that might help us debug this: Chain exists of: blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#89); lock(cgroup_tree_mutex); lock(s_active#89); lock(blkcg_pol_mutex); * DEADLOCK * 4 locks held by trinity-c173/9024: #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714) #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530) #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283) #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) stack backtrace: CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002 ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004 ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0 Call Trace: dump_stack (lib/dump_stack.c:52) print_circular_bug (kernel/locking/lockdep.c:1216) __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) This is a highly unlikely but valid circular dependency between "echo 1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going through further locking update which will remove this complication but for now let's use trylock on blkcg_pol_mutex and retry the file operation if the trylock fails. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com	2014-05-05 13:48:18 -04:00
Aristeu Rozanski	d2c2b11cfa	device_cgroup: check if exception removal is allowed [PATCH v3 1/2] device_cgroup: check if exception removal is allowed When the device cgroup hierarchy was introduced in `bd2953ebbb` - devcg: propagate local changes down the hierarchy a specific case was overlooked. Consider the hierarchy bellow: A default policy: ALLOW, exceptions will deny access \ B default policy: ALLOW, exceptions will deny access There's no need to verify when an new exception is added to B because in this case exceptions will deny access to further devices, which is always fine. Hierarchy in device cgroup only makes sure B won't have more access than A. But when an exception is removed (by writing devices.allow), it isn't checked if the user is in fact removing an inherited exception from A, thus giving more access to B. Example: # echo 'a' >A/devices.allow # echo 'c 1:3 rw' >A/devices.deny # echo $$ >A/B/tasks # echo >/dev/null -bash: /dev/null: Operation not permitted # echo 'c 1:3 w' >A/B/devices.allow # echo >/dev/null # This shouldn't be allowed and this patch fixes it by making sure to never allow exceptions in this case to be removed if the exception is partially or fully present on the parent. v3: missing '*' in function description v2: improved log message and formatting fixes Cc: cgroups@vger.kernel.org Cc: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-05 11:20:12 -04:00
Aristeu Rozanski	f5f3cf6f7e	device_cgroup: fix the comment format for recently added functions Moving more extensive explanations to the end of the comment. Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-04 15:21:09 -04:00
Tejun Heo	15a4c835e4	cgroup, memcg: implement css->id and convert css_from_id() to use it Until now, cgroup->id has been used to identify all the associated csses and css_from_id() takes cgroup ID and returns the matching css by looking up the cgroup and then dereferencing the css associated with it; however, now that the lifetimes of cgroup and css are separate, this is incorrect and breaks on the unified hierarchy when a controller is disabled and enabled back again before the previous instance is released. This patch adds css->id which is a subsystem-unique ID and converts css_from_id() to look up by the new css->id instead. memcg is the only user of css_from_id() and also converted to use css->id instead. For traditional hierarchies, this shouldn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	ddfcadab35	cgroup: update init_css() into init_and_link_css() init_css() takes the cgroup the new css belongs to as an argument and initializes the new css's ->cgroup and ->parent pointers but doesn't acquire the matching reference counts. After the previous patch, create_css() puts init_css() and reference acquisition right next to each other. Let's move reference acquistion into init_css() and rename the function to init_and_link_css(). This makes sense and is easier to follow. This makes the root csses to hold a reference on cgrp_dfl_root.cgrp, which is harmless. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	a2bed8209a	cgroup: use RCU free in create_css() failure path Currently, when create_css() fails in the middle, the half-initialized css is freed by invoking cgroup_subsys->css_free() directly. This patch updates the function so that it invokes RCU free path instead. As the RCU free path puts the parent css and owning cgroup, their references are now acquired right after a new css is successfully allocated. This doesn't make any visible difference now but is to enable implementing css->id and RCU protected lookup by such IDs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	6fa4918d03	cgroup: protect cgroup_root->cgroup_idr with a spinlock Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which ends up requiring cgroup_put() to be invoked under sleepable context. This is okay for now but is an unusual requirement and we'll soon add css->id which will have the same problem but won't be able to simply grab cgroup_mutex as removal will have to happen from css_release() which can't sleep. Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers which protects the idr operations with the lock and use them for cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab cgroup_mutex and css_from_id() is updated to always require RCU read lock instead of either RCU read lock or cgroup_mutex, which doesn't affect the existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Tejun Heo	7d699ddb2b	cgroup, memcg: allocate cgroup ID from 1 Currently, cgroup->id is allocated from 0, which is always assigned to the root cgroup; unfortunately, memcg wants to use ID 0 to indicate invalid IDs and ends up incrementing all IDs by one. It's reasonable to reserve 0 for special purposes. This patch updates cgroup core so that ID 0 is not used and the root cgroups get ID 1. The ID incrementing is removed form memcg. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Tejun Heo	69dfa00ccb	cgroup: make flags and subsys_masks unsigned int There's no reason to use atomic bitops for cgroup_subsys_state->flags, cgroup_root->flags and various subsys_masks. This patch updates those to use bitwise and/or operations instead and converts them form unsigned long to unsigned int. This makes the fields occupy (marginally) smaller space and makes it clear that they don't require atomicity. This patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Joe Perches	ed3d261b53	cgroup: Use more current logging style Use pr_fmt and remove embedded prefixes. Realign modified multi-line statements to open parenthesis. Convert embedded function name to "%s: ", __func__ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	a2a1f9eaf9	cgroup: replace pr_warning with preferred pr_warn As suggested by scripts/checkpatch.pl, substitude all pr_warning() with pr_warn(). No functional change. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	f8719ccf7b	cgroup: remove orphaned cgroup_pidlist_seq_operations `6612f05b88` ("cgroup: unify pidlist and other file handling") has removed the only user of cgroup_pidlist_seq_operations : cgroup_pidlist_open(). This patch removes it. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	2f0edc04e7	cgroup: clean up obsolete comment for parse_cgroupfs_options() `1d5be6b287` ("cgroup: move module ref handling into rebind_subsystems()") makes parse_cgroupfs_options() no longer takes refcounts on subsystems. And unified hierachy makes parse_cgroupfs_options not need to call with cgroup_mutex held to protect the cgroup_subsys[]. So this patch removes BUG_ON() and the comment. As the comment doesn't contain useful information afterwards, the whole comment is removed. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Tejun Heo	6573157800	cgroup: add documentation about unified hierarchy Unified hierarchy will be the new version of cgroup interface. This patch adds Documentation/cgroups/unified-hierarchy.txt which describes the design and rationales of unified hierarchy. v2: Grammatical updates as per Randy Dunlap's review. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org>	2014-04-25 18:28:02 -04:00
Tejun Heo	842b597ee0	cgroup: implement cgroup.populated for the default hierarchy cgroup users often need a way to determine when a cgroup's subhierarchy becomes empty so that it can be cleaned up. cgroup currently provides release_agent for it; unfortunately, this mechanism is riddled with issues. * It delivers events by forking and execing a userland binary specified as the release_agent. This is a long deprecated method of notification delivery. It's extremely heavy, slow and cumbersome to integrate with larger infrastructure. * There is single monitoring point at the root. There's no way to delegate management of a subtree. * The event isn't recursive. It triggers when a cgroup doesn't have any tasks or child cgroups. Events for internal nodes trigger only after all children are removed. This again makes it impossible to delegate management of a subtree. * Events are filtered from the kernel side. "notify_on_release" file is used to subscribe to or suppress release event. This is unnecessarily complicated and probably done this way because event delivery itself was expensive. This patch implements interface file "cgroup.populated" which can be used to monitor whether the cgroup's subhierarchy has tasks in it or not. Its value is 0 if there is no task in the cgroup and its descendants; otherwise, 1, and kernfs_notify() notificaiton is triggers when the value changes, which can be monitored through poll and [di]notify. This is a lot ligther and simpler and trivially allows delegating management of subhierarchy - subhierarchy monitoring can block further propgation simply by putting itself or another process in the root of the subhierarchy and monitor events that it's interested in from there without interfering with monitoring higher in the tree. v2: Patch description updated as per Serge. v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The subtree_ prefix was a bit confusing because "cgroup.subtree_control" uses it to denote the tree rooted at the cgroup sans the cgroup itself while the populated state includes the cgroup itself. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Lennart Poettering <lennart@poettering.net>	2014-04-25 18:28:02 -04:00
Tejun Heo	50bce01b0e	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into for-3.16 Pull in driver-core-next to receive kernfs_notify() updates which will be used by the planned "cgroup.populated" implementation. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:25:55 -04:00
Michael Marineau	86d56134f1	kobject: Make support for uevent_helper optional. Support for uevent_helper, aka hotplug, is not required on many systems these days but it can still be enabled via sysfs or sysctl. Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com> Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-04-25 12:00:49 -07:00
Tejun Heo	d911d98748	kernfs: make kernfs_notify() trigger inotify events too kernfs_notify() is used to indicate either new data is available or the content of a file has changed. It currently only triggers poll which may not be the most convenient to monitor especially when there are a lot to monitor. Let's hook it up to fsnotify too so that the events can be monitored via inotify too. fsnotify_modify() requires file * but kernfs_notify() doesn't have any specific file associated; however, we can walk all super_blocks associated with a kernfs_root and as kernfs always associate one ino with inode and one dentry with an inode, it's trivial to look up the dentry associated with a given kernfs_node. As any active monitor would pin dentry, just looking up existing dentry is enough. This patch looks up the dentry associated with the specified kernfs_node and generates events equivalent to fsnotify_modify(). Note that as fsnotify doesn't provide fsnotify_modify() equivalent which can be called with dentry, kernfs_notify() directly calls fsnotify_parent() and fsnotify(). It might be better to add a wrapper in fsnotify.h instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-04-25 11:43:31 -07:00
Tejun Heo	7d568a8383	kernfs: implement kernfs_root->supers list Currently, there's no way to find out which super_blocks are associated with a given kernfs_root. Let's implement it - the planned inotify extension to kernfs_notify() needs it. Make kernfs_super_info point back to the super_block and chain it at kernfs_root->supers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-04-25 11:43:31 -07:00
Tejun Heo	f8f22e53a2	cgroup: implement dynamic subtree controller enable/disable on the default hierarchy cgroup is switching away from multiple hierarchies and will use one unified default hierarchy where controllers can be dynamically enabled and disabled per subtree. The default hierarchy will serve as the unified hierarchy to which all controllers are attached and a css on the default hierarchy would need to also serve the tasks of descendant cgroups which don't have the controller enabled - ie. the tree may be collapsed from leaf towards root when viewed from specific controllers. This has been implemented through effective css in the previous patches. This patch finally implements dynamic subtree controller enable/disable on the default hierarchy via a new knob - "cgroup.subtree_control" which controls which controllers are enabled on the child cgroups. Let's assume a hierarchy like the following. root - A - B - C \ D root's "cgroup.subtree_control" determines which controllers are enabled on A. A's on B. B's on C and D. This coincides with the fact that controllers on the immediate sub-level are used to distribute the resources of the parent. In fact, it's natural to assume that resource control knobs of a child belong to its parent. Enabling a controller in "cgroup.subtree_control" declares that distribution of the respective resources of the cgroup will be controlled. Note that this means that controller enable states are shared among siblings. The default hierarchy has an extra restriction - only cgroups which don't contain any task may have controllers enabled in "cgroup.subtree_control". Combined with the other properties of the default hierarchy, this guarantees that, from the view point of controllers, tasks are only on the leaf cgroups. In other words, only leaf csses may contain tasks. This rules out situations where child cgroups compete against internal tasks of the parent, which is a competition between two different types of entities without any clear way to determine resource distribution between the two. Different controllers handle it differently and all the implemented behaviors are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this structural constraints imposed from cgroup core removes the burden from controller implementations and enables showing one consistent behavior across all controllers. When a controller is enabled or disabled, css associations for the controller in the subtrees of each child should be updated. After enabling, the whole subtree of a child should point to the new css of the child. After disabling, the whole subtree of a child should point to the cgroup's css. This is implemented by first updating cgroup states such that cgroup_e_css() result points to the appropriate css and then invoking cgroup_update_dfl_csses() which migrates all tasks in the affected subtrees to the self cgroup on the default hierarchy. * When read, "cgroup.subtree_control" lists all the currently enabled controllers on the children of the cgroup. * White-space separated list of controller names prefixed with either '+' or '-' can be written to "cgroup.subtree_control". The ones prefixed with '+' are enabled on the controller and '-' disabled. * A controller can be enabled iff the parent's "cgroup.subtree_control" enables it and disabled iff no child's "cgroup.subtree_control" has it enabled. * If a cgroup has tasks, no controller can be enabled via "cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has some controllers enabled, tasks can't be migrated into the cgroup. * All controllers which aren't bound on other hierarchies are automatically associated with the root cgroup of the default hierarchy. All the controllers which are bound to the default hierarchy are listed in the read-only file "cgroup.controllers" in the root directory. * "cgroup.controllers" in all non-root cgroups is read-only file whose content is equal to that of "cgroup.subtree_control" of the parent. This indicates which controllers can be used in the cgroup's "cgroup.subtree_control". This is still experimental and there are some holes, one of which is that ->can_attach() failure during cgroup_update_dfl_csses() may leave the cgroups in an undefined state. The issues will be addressed by future patches. v2: Non-root cgroups now also have "cgroup.controllers". Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	f817de9851	cgroup: prepare migration path for unified hierarchy Unified hierarchy implementation would require re-migrating tasks onto the same cgroup on the default hierarchy to reflect updated effective csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as the destination cgrp. When NULL is specified, the destination is considered to be the cgroup on the default hierarchy associated with each css_set. After this change, the identity check in cgroup_migrate_add_src() isn't sufficient for noop detection as the associated csses may change without any cgroup association changing. The only way to tell whether a migration is noop or not is testing whether the source and destination csets are identical. The noop check in cgroup_migrate_add_src() is removed and cset identity test is added to cgroup_migreate_prepare_dst(). If it's detected that source and destination csets are identical, the cset is removed removed from @preloaded_csets and all the migration nodes are cleared which makes cgroup_migrate() ignore the cset. Also, make the function append the destination css_sets to @preloaded_list so that destination css_sets always come after source css_sets. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	7fd8c565d8	cgroup: update subsystem rebind restrictions Because the default root couldn't have any non-root csses attached to it, rebinding away from it was always allowed; however, the default hierarchy will soon host the unified hierarchy and have non-root csses so the rebind restrictions need to be updated accordingly. Instead of special casing rebinding from the default hierarchy and then checking whether the source hierarchy has children cgroups, which implies non-root csses for !dfl hierarchies, simply check whether the source hierarchy has non-root csses for the subsystem using css_next_child(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	6803c00628	cgroup: add css_set->dfl_cgrp To implement the unified hierarchy behavior, we'll need to be able to determine the associated cgroup on the default hierarchy from css_set. Let's add css_set->dfl_cgrp so that it can be accessed conveniently and efficiently. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	bd53d617b3	cgroup: allow cgroup creation and suppress automatic css creation in the unified hierarchy Now that effective css handling has been added and iterators updated accordingly, it's safe to allow cgroup creation in the default hierarchy. Unblock cgroup creation in the default hierarchy. As the default hierarchy will implement explicit enabling and disabling of controllers on each cgroup, suppress automatic css enabling on cgroup creation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	e329780310	cgroup: cgroup->subsys[] should be cleared after the css is offlined After a css finishes offlining, offline_css() mistakenly performs RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the cgroup->subsys[] pointer to the current value. The intention was to clear it after offline is complete, not reassign the same value. Update it to assign NULL instead of the current value. This makes cgroup_css() to return NULL once offline is complete. All the existing users of the function either can handle NULL return already or guarantee that the css doesn't get offlined. While this is a bugfix, as css lifetime is currently tied to the cgroup it belongs to, this bug doesn't cause any actual problems. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	3ebb2b6ef3	cgroup: teach css_task_iter about effective csses Currently, css_task_iter iterates tasks associated with a css by visiting each css_set associated with the owning cgroup and walking tasks of each of them. This works fine for !unified hierarchies as each cgroup has its own css for each associated subsystem on the hierarchy; however, on the planned unified hierarchy, a cgroup may not have csses associated and its tasks would be considered associated with the matching css of the nearest ancestor which has the subsystem enabled. This means that on the default unified hierarchy, just walking all tasks associated with a cgroup isn't enough to walk all tasks which are associated with the specified css. If any of its children doesn't have the matching css enabled, task iteration should also include all tasks from the subtree. We already added cgroup->e_csets[] to list all css_sets effectively associated with a given css and walk css_sets on that list instead to achieve such iteration. This patch updates css_task_iter iteration such that it walks css_sets on cgroup->e_csets[] instead of cgroup->cset_links if iteration is requested on an non-dummy css. Thanks to the previous iteration update, this change can be achieved with the addition of css_task_iter->ss and minimal updates to css_advance_task_iter() and css_task_iter_start(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	0f0a2b4fa6	cgroup: reorganize css_task_iter This patch reorganizes css_task_iter so that adding effective css support is easier. * s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency * ->origin_css is used to determine whether the iteration reached the last css_set. Replace it with explicit ->cset_head so that css_advance_task_iter() doesn't have to know the termination condition directly. * css_task_iter_next() currently assumes that it's walking list of cgrp_cset_link and reaches into the current cset through the current link to determine the termination conditions for task walking. As this won't always be true for effective css walking, add ->tasks_head and ->mg_tasks_head and use them to control task walking so that css_task_iter_next() doesn't have to know how css_sets are being walked. This patch doesn't make any behavior changes. The iteration logic stays unchanged after the patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	3b281afbc3	cgroup: make css_next_child() skip missing csses css_next_child() walks the children of the specified css. It does this by finding the next cgroup and then returning the requested css. On the default unified hierarchy, a cgroup may not have a css associated with it even if the hierarchy has the subsystem enabled. This patch updates css_next_child() so that it skips children without the requested css associated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	2d8f243a5e	cgroup: implement cgroup->e_csets[] On the default unified hierarchy, a cgroup may be associated with csses of its ancestors, which means that a css of a given cgroup may be associated with css_sets of descendant cgroups. This means that we can't walk all tasks associated with a css by iterating the css_sets associated with the cgroup as there are css_sets which are pointing to the css but linked on the descendants. This patch adds per-subsystem list heads cgroup->e_csets[]. Any css_set which is pointing to a css is linked to css->cgroup->e_csets[$SUBSYS_ID] through css_set->e_cset_node[$SUBSYS_ID]. The lists are protected by css_set_rwsem and will allow us to walk all css_sets associated with a given css so that we can find out all associated tasks. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	aec3dfcb2e	cgroup: introduce effective cgroup_subsys_state In the planned default unified hierarchy, controllers may get dynamically attached to and detached from a cgroup and a cgroup may not have csses for all the controllers associated with the hierarchy. When a cgroup doesn't have its own css for a given controller, the css of the nearest ancestor with the controller enabled will be used, which is called the effective css. This patch introduces cgroup_e_css() and for_each_e_css() to access the effective csses and convert compare_css_sets(), find_existing_css_set() and cgroup_migrate() to use the effective csses so that they can handle cgroups with partial csses correctly. This means that for two css_sets to be considered identical, they should have both matching csses and cgroups. compare_css_sets() already compares both, not for correctness but for optimization. As this now becomes a matter of correctness, update the comments accordingly. For all !default hierarchies, cgroup_e_css() always equals cgroup_css(), so this patch doesn't change behavior. While at it, fix incorrect locking comment for for_each_css(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Tejun Heo	f392e51cd6	cgroup: update cgroup->subsys_mask to ->child_subsys_mask and restore cgroup_root->subsys_mask `944196278d` ("cgroup: move ->subsys_mask from cgroupfs_root to cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for the unified hierarhcy; however, it turns out that carrying the subsys_mask of the children in the parent, instead of itself, is a lot more natural. This patch restores cgroup_root->subsys_mask and morphs cgroup->subsys_mask into cgroup->child_subsys_mask. * Uses of root->cgrp.subsys_mask are restored to root->subsys_mask. * Remove automatic setting and clearing of cgrp->subsys_mask and instead just inherit ->child_subsys_mask from the parent during cgroup creation. Note that this doesn't affect any current behaviors. * Undo __kill_css() separation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Tejun Heo	ea8fd3b47f	cgroup: cgroup_apply_cftypes() shouldn't skip the default hierarhcy cgroup_apply_cftypes() skip creating or removing files if the subsystem is attached to the default hierarchy, which led to missing files in the root of the default hierarchy. Skipping made sense when the default hierarchy was dummy; however, now that the default hierarchy is full functional and planned to be used as the unified hierarchy, it shouldn't be skipped over. Reported-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Aristeu Rozanski	79d719749d	device_cgroup: rework device access check and exception checking Whenever a device file is opened and checked against current device cgroup rules, it uses the same function (may_access()) as when a new exception rule is added by writing devices.{allow,deny}. And in both cases, the algorithm is the same, doesn't matter the behavior. First problem is having device access to be considered the same as rule checking. Consider the following structure: A (default behavior: allow, exceptions disallow access) \ B (default behavior: allow, exceptions disallow access) A new exception is added to B by writing devices.deny: c 12:34 rw When checking if that exception is allowed in may_access(): if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) { if (behavior == DEVCG_DEFAULT_ALLOW) { /* the exception will deny access to certain devices / return true; Which is ok, since B is not getting more privileges than A, it doesn't matter and the rule is accepted Now, consider it's a device file open check and the process belongs to cgroup B. The access will be generated as: behavior: allow exception: c 12:34 rw The very same chunk of code will allow it, even if there's an explicit exception telling to do otherwise. A simple test case: # mkdir new_group # cd new_group # echo $$ >tasks # echo "c 1:3 w" >devices.deny # echo >/dev/null # echo $? 0 This is a serious bug and was introduced on `c39a2a3018` devcg: prepare may_access() for hierarchy support To solve this problem, the device file open function was split from the new exception check. Second problem is how exceptions are processed by may_access(). The first part of the said function tries to match fully with an existing exception: list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) { if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) continue; if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR)) continue; if (ex->major != ~0 && ex->major != refex->major) continue; if (ex->minor != ~0 && ex->minor != refex->minor) continue; if (refex->access & (~ex->access)) continue; match = true; break; } That means the new exception should be contained into an existing one to be considered a match: New exception Existing match? notes b 12:34 rwm b 12:34 rwm yes b 12:34 r b :34 rw yes b 12:34 rw b 12:34 w no extra "r" b :34 rw b 12:34 rw no too broad "" b :34 rw b :34 rwm yes Which is fine in some cases. Consider: A (default behavior: deny, exceptions allow access) \ B (default behavior: deny, exceptions allow access) In this case the full match makes sense, the new exception cannot add more access than the parent allows But this doesn't always work, consider: A (default behavior: allow, exceptions disallow access) \ B (default behavior: deny, exceptions allow access) In this case, a new exception in B shouldn't match any of the exceptions in A, after all you can't allow something that was forbidden by A. But consider this scenario: New exception Existing in A match? outcome b 12:34 rw b 12:34 r no exception is accepted Because the new exception has "w" as extra, it doesn't match, so it'll be added to B's exception list. The same problem can happen during a file access check. Consider a cgroup with allow as default behavior: Access Exception match? b 12:34 rw b 12:34 r no In this case, the access didn't match any of the exceptions in the cgroup, which is required since exceptions will disallow access. To solve this problem, two new functions were created to match an exception either fully or partially. In the example above, a partial check will be performed and it'll produce a match since at least "b 12:34 r" from "b 12:34 rw" access matches. Cc: cgroups@vger.kernel.org Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-21 18:21:42 -04:00
Linus Torvalds	a798c10faf	Linux 3.15-rc2	2014-04-20 11:08:50 -07:00

1 2 3 4 5 ...

441262 Commits All Branches Search

441262 Commits

All Branches