linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
Al Viro	8d8b97ba49	take cgroup_open() and cpuset_open() to fs/proc/base.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:46 -04:00
Linus Torvalds	16fa94b532	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this development cycle were: - full dynticks preparatory work by Frederic Weisbecker - factor out the cpu time accounting code better, by Li Zefan - multi-CPU load balancer cleanups and improvements by Joonsoo Kim - various smaller fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) sched: Fix init NOHZ_IDLE flag sched: Prevent to re-select dst-cpu in load_balance() sched: Rename load_balance_tmpmask to load_balance_mask sched: Move up affinity check to mitigate useless redoing overhead sched: Don't consider other cpus in our group in case of NEWLY_IDLE sched: Explicitly cpu_idle_type checking in rebalance_domains() sched: Change position of resched_cpu() in load_balance() sched: Fix wrong rq's runnable_avg update with rt tasks sched: Document task_struct::personality field sched/cpuacct/UML: Fix header file dependency bug on the UML build cgroup: Kill subsys.active flag sched/cpuacct: No need to check subsys active state sched/cpuacct: Initialize cpuacct subsystem earlier sched/cpuacct: Initialize root cpuacct earlier sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically sched/cpuacct: Clean up cpuacct.h sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field() sched/cpuacct: Remove redundant NULL checks in cpuacct_charge() sched/cpuacct: Add cpuacct_acount_field() sched/cpuacct: Add cpuacct_init() ...	2013-04-30 07:43:28 -07:00
Linus Torvalds	191a712090	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...	2013-04-29 19:14:20 -07:00
Linus Torvalds	46d9be3e5e	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...	2013-04-29 19:07:40 -07:00
Michal Hocko	6d2488f64a	cgroup: remove css_get_next Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:33 -07:00
Li Zefan	7ef70e4873	cgroup: restore the call to eventfd->poll() I mistakenly removed the call to eventfd->poll() while I was actually intending to remove the return value... Calling evenfd->poll() will hook cgroup_event_wake() to the poll waitqueue, which will be called to unregister eventfd when rmdir a cgroup or close eventfd. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-26 11:58:03 -07:00
Li Zefan	cc20e01cd6	cgroup: fix use-after-free when umounting cgroupfs Try: # mount -t cgroup xxx /cgroup # mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup And you might see this: ida_remove called for id=1 which is not allocated. It's because cgroup_kill_sb() is called to destroy root->cgroup_ida and free cgrp->root before ida_simple_removed() is called. What's worse is we're accessing cgrp->root while it has been freed. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-26 11:58:02 -07:00
Li Zefan	712317ad97	cgroup: fix broken file xattrs We should store file xattrs in struct cfent instead of struct cftype, because cftype is a type while cfent is object instance of cftype. For example each cgroup has a tasks file, and each tasks file is associated with a uniq cfent, but all those files share the same struct cftype. Alexey Kodanev reported a crash, which can be reproduced: # mount -t cgroup -o xattr /sys/fs/cgroup # mkdir /sys/fs/cgroup/test # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks # rmdir /sys/fs/cgroup/test # umount /sys/fs/cgroup oops! In this case, simple_xattrs_free() will free the same struct simple_xattrs twice. tj: Dropped unused local variable @cft from cgroup_diput(). Cc: <stable@vger.kernel.org> # 3.8.x Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-18 23:11:40 -07:00
Li Zefan	05fb22ec54	cgroup: remove cgrp->top_cgroup It's not used, and it can be retrieved via cgrp->root->top_cgroup. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-14 23:26:10 -07:00
Tejun Heo	873fe09ea5	cgroup: introduce sane_behavior mount option It's a sad fact that at this point various cgroup controllers are carrying so many idiosyncrasies and pure insanities that it simply isn't possible to reach any sort of sane consistent behavior while maintaining staying fully compatible with what already has been exposed to userland. As we can't break exposed userland interface, transitioning to sane behaviors can only be done in steps while maintaining backwards compatibility. This patch introduces a new mount option - __DEVEL__sane_behavior - which disables crazy features and enforces consistent behaviors in cgroup core proper and various controllers. As exactly which behaviors it changes are still being determined, the mount option, at this point, is useful only for development of the new behaviors. As such, the mount option is prefixed with __DEVEL__ and generates a warning message when used. Eventually, once we get to the point where all controller's behaviors are consistent enough to implement unified hierarchy, the __DEVEL__ prefix will be dropped, and more importantly, unified-hierarchy will enforce sane_behavior by default. Maybe we'll able to completely drop the crazy stuff after a while, maybe not, but we at least have a strategy to move on to saner behaviors. This patch introduces the mount option and changes the following behaviors in cgroup core. * Mount options "noprefix" and "clone_children" are disallowed. Also, cgroupfs file cgroup.clone_children is not created. * When mounting an existing superblock, mount options should match. This is currently pretty crazy. If one mounts a cgroup, creates a subdirectory, unmounts it and then mount it again with different option, it looks like the new options are applied but they aren't. * Remount is disallowed. The behaviors changes are documented in the comment above CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different controllers are converted and planned improvements progress. v2: Dropped unnecessary explicit file permission setting sane_behavior cftype entry as suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com>	2013-04-14 20:15:26 -07:00
Tejun Heo	25a7e6848d	move cgroupfs_root to include/linux/cgroup.h While controllers shouldn't be accessing cgroupfs_root directly, it being hidden inside kern/cgroup.c makes somethings pretty silly. This makes routing hierarchy-wide settings which need to be visible to controllers cumbersome. We're gonna add another hierarchy-wide setting which needs to be accessed from controllers. Move cgroupfs_root and its flags to the header file so that we can access root settings with inline helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-14 20:15:25 -07:00
Tejun Heo	9343862945	cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix There's no reason to be using bitops, which tends to be more cumbersome, to handle root flags. Convert them to masks. Also, as they'll be moved to include/linux/cgroup.h and it's generally a good idea, add CGRP_ prefix. Note that flags are assigned from (1 << 1). The first bit will be used by a flag which will be added soon. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-14 20:15:25 -07:00
Tejun Heo	da1f296fd2	cgroup: make cgroup_path() not print double slashes While reimplementing cgroup_path(), `65dff759d2` ("cgroup: fix cgroup_path() vs rename() race") introduced a bug where the path of a non-root cgroup would have two slahses at the beginning, which is caused by treating the root cgroup which has the name '/' like non-root cgroups. $ grep systemd /proc/self/cgroup 1:name=systemd://user/root/1 Fix it by special casing root cgroup case and not looping over it in the normal path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com>	2013-04-14 10:47:02 -07:00
Tejun Heo	26d5bbe5ba	Revert "cgroup: remove bind() method from cgroup_subsys." This reverts commit `84cfb6ab48`. There are scheduled changes which make use of the removed callback. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rami Rosen <ramirose@gmail.com> Cc: Li Zefan <lizefan@huawei.com>	2013-04-12 10:29:04 -07:00
Li Zefan	78574cf981	cgroup: implement cgroup_is_descendant() A couple controllers want to determine whether two cgroups are in ancestor/descendant relationship. As it's more likely that the descendant is the primary subject of interest and there are other operations focusing on the descendants, let's ask is_descendent rather than is_ancestor. Implementation is trivial as the previous patch guarantees that all ancestors of a cgroup stay accessible as long as the cgroup is accessible. tj: Removed depth optimization, renamed from cgroup_is_ancestor(), rewrote descriptions. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-10 11:07:08 -07:00
Li Zefan	415cf07a1c	cgroup: make sure parent won't be destroyed before its children Suppose we rmdir a cgroup and there're still css refs, this cgroup won't be freed. Then we rmdir the parent cgroup, and the parent is freed immediately due to css ref draining to 0. Now it would be a disaster if the still-alive child cgroup tries to access its parent. Make sure this won't happen. Signed-off-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-10 11:07:00 -07:00
Rami Rosen	84cfb6ab48	cgroup: remove bind() method from cgroup_subsys. The bind() method of cgroup_subsys is not used in any of the controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio, devices, perf, hugetlb, cpu and cpuacct) tj: Removed the entry on ->bind() from Documentation/cgroups/cgroups.txt. Also updated a couple paragraphs which were suggesting that dynamic re-binding may be implemented. It's not gonna. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-10 10:46:59 -07:00
Li Zefan	479f614110	cgroup: Kill subsys.active flag The only user was cpuacct. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 13:54:22 +02:00
Tejun Heo	2219449a65	cgroup: remove cgroup_lock_is_held() We don't want controllers to assume that the information is officially available and do funky things with it. The only user is task_subsys_state_check() which uses it to verify RCU access context. We can move cgroup_lock_is_held() inside CONFIG_PROVE_RCU but that doesn't add meaningful protection compared to conditionally exposing cgroup_mutex. Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU and use lockdep_is_held() directly on the mutex in task_subsys_state_check(). While at it, add parentheses around macro arguments in task_subsys_state_check(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-07 09:29:51 -07:00
Tejun Heo	47cfcd0922	cgroup: kill cgroup_[un]lock() Now that locking interface is unexported, there's no reason to keep around these thin wrappers. Kill them and use mutex operations directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-07 09:29:51 -07:00
Tejun Heo	b9777cf8d7	cgroup: unexport locking interface and cgroup_attach_task() Now that all external cgroup_lock() users are gone, we can finally unexport the locking interface and prevent future abuse of cgroup_mutex. Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also, cgroup_attach_task() doesn't have any user left and can't be used without locking interface anyway. Make it static too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-07 09:29:51 -07:00
Tejun Heo	7ae1bad99e	cgroup: relocate cgroup_lock_live_group() and cgroup_attach_task_all() cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be made static. Relocate the former and cgroup_attach_task_all() so that we don't need forward declarations. This patch is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-07 09:29:51 -07:00
Tejun Heo	8cc9934520	cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks() When a cpuset becomes empty (no CPU or memory), its tasks are transferred with the nearest ancestor with execution resources. This is implemented using cgroup_scan_tasks() with a callback which grabs cgroup_mutex and invokes cgroup_attach_task() on each task. Both cgroup_mutex and cgroup_attach_task() are scheduled to be unexported. Implement cgroup_transfer_tasks() in cgroup proper which is essentially the same as move_member_tasks_to_cpuset() except that it takes cgroups instead of cpusets and @to comes before @from like normal functions with those arguments, and replace move_member_tasks_to_cpuset() with it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-04-07 09:29:50 -07:00
Kevin Wilson	1e2ccd1c0f	cgroup: remove unused parameter in cgroup_task_migrate(). This patch removes unused parameter from cgroup_task_migrate(). Signed-off-by: Kevin Wilson <wkevils@gmail.com> Acked-by: Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-03 14:04:33 -07:00
Li Zefan	081aa458c3	cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc() These two functions share most of the code. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 07:50:25 -07:00
Li Zefan	3ac1707a13	cgroup: fix an off-by-one bug which may trigger BUG_ON() The 3rd parameter of flex_array_prealloc() is the number of elements, not the index of the last element. The effect of the bug is, when opening cgroup.procs, a flex array will be allocated and all elements of the array is allocated with GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to allocate memory for it, it'll trigger a BUG_ON(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-03-20 07:50:04 -07:00
Tejun Heo	14a40ffccd	sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY PF_THREAD_BOUND was originally used to mark kernel threads which were bound to a specific CPU using kthread_bind() and a task with the flag set allows cpus_allowed modifications only to itself. Workqueue is currently abusing it to prevent userland from meddling with cpus_allowed of workqueue workers. What we need is a flag to prevent userland from messing with cpus_allowed of certain kernel tasks. In kernel, anyone can (incorrectly) squash the flag, and, for worker-type usages, restricting cpus_allowed modification to the task itself doesn't provide meaningful extra proection as other tasks can inject work items to the task anyway. This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY. sched_setaffinity() checks the flag and return -EINVAL if set. set_cpus_allowed_ptr() is no longer affected by the flag. This will allow simplifying workqueue worker CPU affinity management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2013-03-19 13:45:20 -07:00
Li Zefan	80f36c2a1a	cgroup: remove useless code in cgroup_write_event_control() eventfd_poll() never returns POLLHUP. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 15:36:00 -07:00
Li Zefan	6ee211ad0a	cgroup: don't bother to resize pid array When we open cgroup.procs, we'll allocate an buffer and store all tasks' tgid in it, and then duplicate entries will be stripped. If that results in a much smaller pid list, we'll re-allocate a smaller buffer. But we've already sucessfully allocated memory and reading the procs file is a short period and the memory will be freed very soon, so why bother to re-allocate memory. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 15:36:00 -07:00
Li Zefan	d7eeac1913	cgroup: hold cgroup_mutex before calling css_offline() cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so we don't have to release cgroup_mutex before calling css_offline(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 15:35:59 -07:00
Li Zefan	6dc01181ea	cgroup: remove unused variables in cgroup_destroy_locked() Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 15:35:58 -07:00
Li Zefan	e7b2dcc52b	cgroup: remove cgroup_is_descendant() It was used by ns cgroup, and ns cgroup was removed long ago. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 15:35:58 -07:00
Li Zefan	7d8e0bf56a	cgroup: avoid accessing modular cgroup subsys structure without locking subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload, and that's protected by cgroup_mutex, and then the memory *subsys[i] resides will be freed. So this is unsafe without any locking: if (!ss \|\| ss->module) ... v2: - add a comment for enum cgroup_subsys_id - simplify the comment in cgroup_exit() Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-05 09:33:25 -08:00
Li Zefan	f50daa704f	cgroup: no need to check css refs for release notification We no longer fail rmdir() when there're still css refs, so we don't need to check css refs in check_for_release(). This also voids a bug. cgroup_has_css_refs() accesses subsys[i] without cgroup_mutex, so it can race with cgroup_unload_subsys(). cgroup_has_css_refs() ... if (ss == NULL \|\| ss->root != cgrp->root) if ss pointers to net_cls_subsys, and cls_cgroup module is unloaded right after the former check but before the latter, the memory that net_cls_subsys resides has become invalid. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 10:04:54 -08:00
Li Zefan	65dff759d2	cgroup: fix cgroup_path() vs rename() race rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. As accessing dentry->name must be protected by dentry->d_lock or parent inode's i_mutex, while on the other hand cgroup-path() can be called with some irq-safe spinlocks held, we can't generate cgroup path using dentry->d_name. Alternatively we make a copy of dentry->d_name and save it in cgrp->name when a cgroup is created, and update cgrp->name at rename(). v5: use flexible array instead of zero-size array. v4: - allocate root_cgroup_name and all root_cgroup->name points to it. - add cgroup_name() wrapper. v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path. v2: make cgrp->name RCU safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 09:50:08 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Tejun Heo	d228d9ec2c	cgroup: convert to idr_alloc() Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:19 -08:00
Tejun Heo	c897ff68be	cgroup: don't use idr_remove_all() idr_destroy() can destroy idr by itself and idr_remove_all() is being deprecated. Drop its usage. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:14 -08:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Al Viro	496ad9aa8e	new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-22 23:31:31 -05:00
Li Zefan	f169007b27	cgroup: fail if monitored file and event_control are in different cgroup If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control of cgroup B, then we won't get memory usage notification from A but B! What's worse, if A and B are in different mount hierarchy, we'll end up accessing NULL pointer! Disallow this kind of invalid usage. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-18 09:31:35 -08:00
Li Zefan	810cbee4fa	cgroup: fix cgroup_rmdir() vs close(eventfd) race commit `205a872bd6` ("cgroup: fix lockdep warning for event_control") solved a deadlock by introducing a new bug. Move cgrp->event_list to a temporary list doesn't mean you can traverse this list locklessly, because at the same time cgroup_event_wake() can be called and remove the event from the list. The result of this race is disastrous. We adopt the way how kvm irqfd code implements race-free event removal, which is now described in the comments in cgroup_event_wake(). v3: - call eventfd_signal() no matter it's eventfd close or cgroup removal that removes the cgroup event. Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-18 09:17:24 -08:00
Li Zefan	71b5707e11	cgroup: fix exit() vs rmdir() race In cgroup_exit() put_css_set_taskexit() is called without any lock, which might lead to accessing a freed cgroup: thread1 thread2 --------------------------------------------- exit() cgroup_exit() put_css_set_taskexit() atomic_dec(cgrp->count); rmdir(); /* not safe !! */ check_for_release(cgrp); rcu_read_lock() can be used to make sure the cgroup is alive. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-02-18 09:08:10 -08:00
Li Zefan	9ed8a65970	cgroup: remove bogus comments in cgroup_diput() Since commit `48ddbe1946` ("cgroup: make css->refcnt clearing on cgroup removal optional"), each css holds a ref on cgroup's dentry, so cgroup_diput() won't be called until all css' refs go down to 0, which invalids the comments. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-24 12:05:18 -08:00
Li Zefan	be44562613	cgroup: remove synchronize_rcu() from cgroup_diput() Free cgroup via call_rcu(). The actual work is done through workqueue. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-24 12:05:18 -08:00
Li Zefan	86a3db5643	cgroup: remove duplicate RCU free on struct cgroup When destroying a cgroup, though in cgroup_diput() we've called synchronize_rcu(), we then still have to free it via call_rcu(). The story is, long ago to fix a race between reading /proc/sched_debug and freeing cgroup, the code was changed to utilize call_rcu(). See commit `a47295e6bc` ("cgroups: make cgroup_path() RCU-safe") As we've fixed cpu cgroup that cpu_cgroup_offline_css() is used to unregister a task_group so there won't be concurrent access to this task_group after synchronize_rcu() in diput(). Now we can just kfree(cgrp). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-24 12:05:18 -08:00
Li Zefan	fe1c06ca75	cgroup: initialize cgrp->dentry before css_alloc() With this change, we're guaranteed that cgroup_path() won't see NULL cgrp->dentry, and thus we can remove the NULL check in it. (Well, it's not strictly true, because dummptop.dentry is always NULL but we already handle that separately.) Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-24 12:05:17 -08:00
Li Zefan	b5d646f5d5	cgroup: remove a NULL check in cgroup_exit() init_task.cgroups is initialized at boot phase, and whenver a ask is forked, it's cgroups pointer is inherited from its parent, and it's never set to NULL afterwards. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-24 10:40:05 -08:00
Li Zefan	2739d3cce9	cgroup: fix bogus kernel warnings when cgroup_create() failed If cgroup_create() failed and cgroup_destroy_locked() is called to do cleanup, we'll see a bunch of warnings: cgroup_addrm_files: failed to remove 2MB.limit_in_bytes, err=-2 cgroup_addrm_files: failed to remove 2MB.usage_in_bytes, err=-2 cgroup_addrm_files: failed to remove 2MB.max_usage_in_bytes, err=-2 cgroup_addrm_files: failed to remove 2MB.failcnt, err=-2 cgroup_addrm_files: failed to remove prioidx, err=-2 cgroup_addrm_files: failed to remove ifpriomap, err=-2 ... We failed to remove those files, because cgroup_create() has failed before creating those cgroup files. To fix this, we simply don't warn if cgroup_rm_file() can't find the cft entry. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-22 16:29:07 -08:00
Li Zefan	130e3695a3	cgroup: remove synchronize_rcu() from rebind_subsystems() Nothing's protected by RCU in rebind_subsystems(), and I can't think of a reason why it is needed. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-14 10:49:39 -08:00
Li Zefan	5d65bc0ca1	cgroup: remove synchronize_rcu() from cgroup_attach_{task\|proc}() These 2 syncronize_rcu()s make attaching a task to a cgroup quite slow, and it can't be ignored in some situations. A real case from Colin Cross: Android uses cgroups heavily to manage thread priorities, putting threads in a background group with reduced cpu.shares when they are not visible to the user, and in a foreground group when they are. Some RPCs from foreground threads to background threads will temporarily move the background thread into the foreground group for the duration of the RPC. This results in many calls to cgroup_attach_task. In cgroup_attach_task() it's task->cgroups that is protected by RCU, and put_css_set() calls kfree_rcu() to free it. If we remove this synchronize_rcu(), there can be threads in RCU-read sections accessing their old cgroup via current->cgroups with concurrent rmdir operation, but this is safe. # time for ((i=0; i<50; i++)) { echo $$ > /mnt/sub/tasks; echo $$ > /mnt/tasks; } real 0m2.524s user 0m0.008s sys 0m0.004s With this patch: real 0m0.004s user 0m0.004s sys 0m0.000s tj: These synchronize_rcu()s are utterly confused. synchornize_rcu() necessarily has to come between two operations to guarantee that the changes made by the former operation are visible to all rcu readers before proceeding to the latter operation. Here, synchornize_rcu() are at the end of attach operations with nothing beyond it. Its only effect would be delaying completion of write(2) to sysfs tasks/procs files until all rcu readers see the change, which doesn't mean anything. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Colin Cross <ccross@google.com>	2013-01-14 10:44:42 -08:00
Li Zefan	0ac801fe07	cgroup: use new hashtable implementation Switch cgroup to use the new hashtable implementation. No functional changes. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-10 12:10:31 -08:00
Tejun Heo	12a9d2fef1	cgroup: implement cgroup_rightmost_descendant() Implement cgroup_rightmost_descendant() which returns the right most descendant of the specified cgroup. This can be used to skip the cgroup's subtree while iterating with cgroup_for_each_descendant_pre(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>	2013-01-07 08:50:28 -08:00
Linus Torvalds	848b81415c	Merge branch 'akpm' (Andrew's patch-bomb) Merge misc patches from Andrew Morton: "Incoming: - lots of misc stuff - backlight tree updates - lib/ updates - Oleg's percpu-rwsem changes - checkpatch - rtc - aoe - more checkpoint/restart support I still have a pile of MM stuff pending - Pekka should be merging later today after which that is good to go. A number of other things are twiddling thumbs awaiting maintainer merges." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits) scatterlist: don't BUG when we can trivially return a proper error. docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output fs, fanotify: add @mflags field to fanotify output docs: add documentation about /proc/<pid>/fdinfo/<fd> output fs, notify: add procfs fdinfo helper fs, exportfs: add exportfs_encode_inode_fh() helper fs, exportfs: escape nil dereference if no s_export_op present fs, epoll: add procfs fdinfo helper fs, eventfd: add procfs fdinfo helper procfs: add ability to plug in auxiliary fdinfo providers tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test breakpoint selftests: print failure status instead of cause make error kcmp selftests: print fail status instead of cause make error kcmp selftests: make run_tests fix mem-hotplug selftests: print failure status instead of cause make error cpu-hotplug selftests: print failure status instead of cause make error mqueue selftests: print failure status instead of cause make error vm selftests: print failure status instead of cause make error ubifs: use prandom_bytes mtd: nandsim: use prandom_bytes ...	2012-12-17 20:58:12 -08:00
Tao Ma	8ec7d50f1e	kernel: remove reference to feature-removal-schedule.txt In commit `9c0ece069b` ("Get rid of Documentation/feature-removal.txt"), Linus removed feature-removal-schedule.txt from Documentation, but there is still some reference to this file. So remove them. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:12 -08:00
Linus Torvalds	6a2b60b17b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyranny of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc/<pid>/ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc/<pid>/ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small changes to add user namespace support were commited here adn in David Miller's -net tree so that I could complete the work on the /proc/<pid>/ns/ files in this tree. Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. Future design work remains to allow root users outside of the initial user namespace to mount more than just /proc and /sys." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits) proc: Usable inode numbers for the namespace file descriptors. proc: Fix the namespace inode permission checks. proc: Generalize proc inode allocation userns: Allow unprivilged mounts of proc and sysfs userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file procfs: Print task uids and gids in the userns that opened the proc file userns: Implement unshare of the user namespace userns: Implent proc namespace operations userns: Kill task_user_ns userns: Make create_new_namespaces take a user_ns parameter userns: Allow unprivileged use of setns. userns: Allow unprivileged users to create new namespaces userns: Allow setting a userns mapping to your current uid. userns: Allow chown and setgid preservation userns: Allow unprivileged users to create user namespaces. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: fix return value on mntns_install() failure vfs: Allow unprivileged manipulation of the mount namespace. vfs: Only support slave subtrees across different user namespaces vfs: Add a user namespace reference from struct mnt_namespace ...	2012-12-17 15:44:47 -08:00
Gao feng	f33fddc2b9	cgroup_rm_file: don't delete the uncreated files in cgroup_add_file,when creating files for cgroup, some of creation may be skipped. So we need to avoid deleting these uncreated files in cgroup_rm_file, otherwise the warning msg will be triggered. "cgroup_addrm_files: failed to remove memory_pressure_enabled, err=-2" Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@redhat.com> Cc: stable@vger.kernel.org	2012-12-06 08:58:11 -08:00
Gao feng	7083d0378a	cgroup: remove subsystem files when remounting cgroup cgroup_clear_directroy is called by cgroup_d_remove_dir and cgroup_remount. when we call cgroup_remount to remount the cgroup,the subsystem may be unlinked from cgroupfs_root->subsys_list in rebind_subsystem,this subsystem's files will not be removed in cgroup_clear_directroy. And the system will panic when we try to access these files. this patch removes subsystems's files before rebind_subsystems, if rebind_subsystems failed, repopulate these removed files. With help from Tejun. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-03 08:33:11 -08:00
Gao feng	879a3d9dbb	cgroup: use cgroup_addrm_files() in cgroup_clear_directory() cgroup_clear_directory() incorrectly invokes cgroup_rm_file() on each cftset of the target subsystems, which only removes the first file of each set. This leaves dangling files after subsystems are removed from a cgroup root via remount. Use cgroup_addrm_files() to remove all files of target subsystems. tj: Move cgroup_addrm_files() prototype decl upwards next to other global declarations. Commit message updated. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-30 11:44:12 -08:00
Glauber Costa	1f869e8711	cgroup: warn about broken hierarchies only after css_online If everything goes right, it shouldn't really matter if we are spitting this warning after css_alloc or css_online. If we fail between then, there are some ill cases where we would previously see the message and now we won't (like if the files fail to be created). I believe it really shouldn't matter: this message is intended in spirit to be shown when creation succeeds, but with insane settings. Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-30 07:11:07 -08:00
Greg Thelen	9718ceb343	cgroup: list_del_init() on removed events Use list_del_init() rather than list_del() to remove events from cgrp->event_list. No functional change. This is just defensive coding. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 13:52:14 -08:00
Greg Thelen	205a872bd6	cgroup: fix lockdep warning for event_control The cgroup_event_wake() function is called with the wait queue head locked and it takes cgrp->event_list_lock. However, in cgroup_rmdir() remove_wait_queue() was being called after taking cgrp->event_list_lock. Correct the lock ordering by using a temporary list to obtain the event list to remove from the wait queue. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Aaron Durbin <adurbin@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 13:51:56 -08:00
Li Zhong	fddfb02ad0	cgroup: move list add after list head initilization `2243076ad1` ("cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping()") initializes cgrp->allcg_node in init_cgroup_housekeeping(). Then in init_cgroup_root(), we should call init_cgroup_housekeeping() before adding it to &root->allcg_list; otherwise, we are initializing an entry already in a list. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 06:02:39 -08:00
Tao Ma	d0b2fdd2a5	cgroup: remove obsolete guarantee from cgroup_task_migrate. 'guarantee' is already removed from cgroup_task_migrate, so remove the corresponding comments. Some other typos in cgroup are also changed. Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-20 06:44:58 -08:00
Tejun Heo	0a950f65e1	cgroup: add cgroup->id With the introduction of generic cgroup hierarchy iterators, css_id is being phased out. It was unnecessarily complex, id'ing the wrong thing (cgroups need IDs, not CSSes) and has other oddities like not being available at ->css_alloc(). This patch adds cgroup->id, which is a simple per-hierarchy ida-allocated ID which is assigned before ->css_alloc() and released after ->css_free(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2012-11-19 09:02:12 -08:00
Tejun Heo	033fa1c5f5	cgroup, cpuset: remove cgroup_subsys->post_clone() Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now that clone_children is cpuset specific, there's no reason to have this rather odd option activation mechanism in cgroup core. cpuset can check the flag from its ->css_allocate() and take the necessary action. Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and remove cgroup_subsys->post_clone(). Loosely based on Glauber's "generalize post_clone into post_create" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Glauber Costa <glommer@parallels.com> Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:39 -08:00
Tejun Heo	2260e7fc1f	cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/ clone_children is only meaningful for cpuset and will stay that way. Rename the flag to reflect that and update documentation. Also, drop clone_children() wrapper in cgroup.c. The thin wrapper is used only a few times and one of them will go away soon. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	92fb97487a	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	b1929db42f	cgroup: allow ->post_create() to fail There could be cases where controllers want to do initialization operations which may fail from ->post_create(). This patch makes ->post_create() return -errno to indicate failure and online_css() relay such failures. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	4b8b47eb00	cgroup: update cgroup_create() failure path cgroup_create() was ignoring failure of cgroupfs files. Update it such that, if file creation fails, it rolls back by calling cgroup_destroy_locked() and returns failure. Note that error out goto labels are renamed. The labels are a bit confusing but will become better w/ later cgroup operation renames. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	b8a2df6a5b	cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory All cgroup directory i_mutexes nest outside cgroup_mutex; however, new directory creation is a special case. A new cgroup directory is created while holding cgroup_mutex. Populating the new directory requires both the new directory's i_mutex and cgroup_mutex. Because all directory i_mutexes nest outside cgroup_mutex, grabbing both requires releasing cgroup_mutex first, which isn't a good idea as the new cgroup isn't yet ready to be manipulated by other cgroup opreations. This is worked around by grabbing the new directory's i_mutex while holding cgroup_mutex before making it visible. As there's no other user at that point, grabbing the i_mutex under cgroup_mutex can't lead to deadlock. cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to worry about the reverse locking order; however, this creates pseudo locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true - all directory i_mutexes are still nested outside cgroup_mutex. This pseudo locking dependency can lead to spurious lockdep warnings. Use mutex_trylock() instead. This will always succeed and lockdep doesn't create any locking dependency for it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	d19e19de48	cgroup: simplify cgroup_load_subsys() failure path Now that cgroup_unload_subsys() can tell whether the root css is online or not, we can safely call cgroup_unload_subsys() after idr init failure in cgroup_load_subsys(). Replace the manual unrolling and invoke cgroup_unload_subsys() on failure. This drops cgroup_mutex inbetween but should be safe as the subsystem will fail try_module_get() and thus can't be mounted inbetween. As this means that cgroup_unload_subsys() can be called before css_sets are rehashed, remove BUG_ON() on %NULL css_set->subsys[] from cgroup_unload_subsys(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	a31f2d3ff7	cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers New helpers on/offline_css() respectively wrap ->post_create() and ->pre_destroy() invocations. online_css() sets CSS_ONLINE after ->post_create() is complete and offline_css() invokes ->pre_destroy() iff CSS_ONLINE is set and clears it while also handling the temporary dropping of cgroup_mutex. This patch doesn't introduce any behavior change at the moment but will be used to improve cgroup_create() failure path and allow ->post_create() to fail. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	42809dd422	cgroup: separate out cgroup_destroy_locked() Separate out cgroup_destroy_locked() from cgroup_destroy(). This will be later used in cgroup_create() failure path. While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move @d and @parent assignments to their declarations. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	02ae7486d0	cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys() * If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[] before calilng ->destroy() making CSS inaccessible to the callback, and didn't unlink ss->sibling. As no modular controller uses ->use_id, this doesn't cause any actual problems. * cgroup_unload_subsys() was forgetting to free idr, call ->pre_destroy() and clear ->active. As there currently is no modular controller which uses ->use_id, ->pre_destroy() or ->active, this doesn't cause any actual problems. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	648bb56d07	cgroup: lock cgroup_mutex in cgroup_init_subsys() Make cgroup_init_subsys() grab cgroup_mutex while initializing a subsystem so that all helpers and callbacks are called under the context they expect. This isn't strictly necessary as cgroup_init_subsys() doesn't race with anybody but will allow adding lockdep assertions. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	b48c6a80a0	cgroup: trivial cleanup for cgroup_init/load_subsys() Consistently use @css and @dummytop in these two functions instead of referring to them indirectly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	38b53abaa3	cgroup: make CSS_* flags bit masks instead of bit positions Currently, CSS_* flags are defined as bit positions and manipulated using atomic bitops. There's no reason to use atomic bitops for them and bit positions are clunkier to deal with than bit masks. Make CSS_* bit masks instead and use the usual C bitwise operators to access them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	febfcef60d	cgroup: cgroup->dentry isn't a RCU pointer cgroup->dentry is marked and used as a RCU pointer; however, it isn't one - the final dentry put doesn't go through call_rcu(). cgroup and dentry share the same RCU freeing rule via synchronize_rcu() in cgroup_diput() (kfree_rcu() used on cgrp is unnecessary). If cgrp is accessible under RCU read lock, so is its dentry and dereferencing cgrp->dentry doesn't need any further RCU protection or annotation. While not being accurate, before the previous patch, the RCU accessors served a purpose as memory barriers - cgroup->dentry used to be assigned after the cgroup was made visible to cgroup_path(), so the assignment and dereferencing in cgroup_path() needed the memory barrier pair. Now that list_add_tail_rcu() happens after cgroup->dentry is assigned, this no longer is necessary. Remove the now unnecessary and misleading RCU annotations from cgroup->dentry. To make up for the removal of rcu_dereference_check() in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts the dereference rule of @cgrp, not cgrp->dentry. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	4e139afc22	cgroup: create directory before linking while creating a new cgroup While creating a new cgroup, cgroup_create() links the newly allocated cgroup into various places before trying to create its directory. Because cgroup life-cycle is tied to the vfs objects, this makes it impossible to use cgroup_rmdir() for rolling back creation - the removal logic depends on having full vfs objects. This patch moves directory creation above linking and collect linking operations to one place. This allows directory creation failure to share error exit path with css allocation failures and any failure sites afterwards (to be added later) can use cgroup_rmdir() logic to undo creation. Note that this also makes the memory barriers around cgroup->dentry, which currently is misleadingly using RCU operations, unnecessary. This will be handled in the next patch. While at it, locking BUG_ON() on i_mutex is converted to lockdep_assert_held(). v2: Patch originally removed %NULL dentry check in cgroup_path(); however, Li pointed out that this patch doesn't make it unnecessary as ->create() may call cgroup_path(). Drop the change for now. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	28fd6f30ac	cgroup: open-code cgroup_create_dir() The operation order of cgroup creation is about to change and cgroup_create_dir() is more of a hindrance than a proper abstraction. Open-code it by moving the parent nlink adjustment next to self nlink adjustment in cgroup_create_file() and the rest to cgroup_create(). This patch doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	2243076ad1	cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping() Not strictly necessary but it's annoying to have uninitialized list_head around. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:35 -08:00
Tejun Heo	175431635e	cgroup: remove incorrect dget/dput() pair in cgroup_create_dir() cgroup_create_dir() does weird dancing with dentry refcnt. On success, it gets and then puts it achieving nothing. On failure, it puts but there isn't no matching get anywhere leading to the following oops if cgroup_create_file() fails for whatever reason. ------------[ cut here ]------------ kernel BUG at /work/os/work/fs/dcache.c:552! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU 2 Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff811d9c0c>] [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP: 0018:ffff88001a3ebef8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403 RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58 RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001 R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60 FS: 00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080) Stack: ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000 ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8 ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8 Call Trace: [<ffffffff811cc889>] done_path_create+0x19/0x50 [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80 [<ffffffff811d2009>] sys_mkdir+0x19/0x20 [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41 RIP [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP <ffff88001a3ebef8> ---[ end trace 1277bcfd9561ddb0 ]--- Fix it by dropping the unnecessary dget/dput() pair. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2012-11-19 08:13:35 -08:00
Eric W. Biederman	17cf22c33e	pidns: Use task_active_pid_ns where appropriate The expressions tsk->nsproxy->pid_ns and task_active_pid_ns aka ns_of_pid(task_pid(tsk)) should have the same number of cache line misses with the practical difference that ns_of_pid(task_pid(tsk)) is released later in a processes life. Furthermore by using task_active_pid_ns it becomes trivial to write an unshare implementation for the the pid namespace. So I have used task_active_pid_ns everywhere I can. In fork since the pid has not yet been attached to the process I use ns_of_pid, to achieve the same effect. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:09 -08:00
Tejun Heo	574bd9f7c7	cgroup: implement generic child / descendant walk macros Currently, cgroup doesn't provide any generic helper for walking a given cgroup's children or descendants. This patch adds the following three macros. * cgroup_for_each_child() - walk immediate children of a cgroup. * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup in pre-order tree traversal. * cgroup_for_each_descendant_post() - visit all descendants of a cgroup in post-order tree traversal. All three only require the user to hold RCU read lock during traversal. Verifying that each iterated cgroup is online is the responsibility of the user. When used with proper synchronization, cgroup_for_each_descendant_pre() can be used to propagate state updates to descendants in reliable way. See comments for details. v2: s/config/state/ in commit message and comments per Michal. More documentation on synchronization rules. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-09 09:12:29 -08:00
Tejun Heo	eb6fd5040e	cgroup: use rculist ops for cgroup->children Use RCU safe list operations for cgroup->children. This will be used to implement cgroup children / descendant walking which can be used by controllers. Note that cgroup_create() now puts a new cgroup at the end of the ->children list instead of head. This isn't strictly necessary but is done so that the iteration order is more conventional. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-09 09:12:29 -08:00
Tejun Heo	a8638030f6	cgroup: add cgroup_subsys->post_create() Currently, there's no way for a controller to find out whether a new cgroup finished all ->create() allocatinos successfully and is considered "live" by cgroup. This becomes a problem later when we add generic descendants walking to cgroup which can be used by controllers as controllers don't have a synchronization point where it can synchronize against new cgroups appearing in such walks. This patch adds ->post_create(). It's called after all ->create() succeeded and the cgroup is linked into the generic cgroup hierarchy. This plays the counterpart of ->pre_destroy(). When used in combination with the to-be-added generic descendant iterators, ->post_create() can be used to implement reliable state inheritance. It will be explained with the descendant iterators. v2: Added a paragraph about its future use w/ descendant iterators per Michal. v3: Forgot to add ->post_create() invocation to cgroup_load_subsys(). Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-09 09:12:29 -08:00
Tao Ma	316eb661f1	cgroup: set 'start' with the right value in cgroup_path. 'start' is set to buf + buflen and do the '--' immediately. Just set it to 'buf + buflen - 1' directly. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com>	2012-11-08 06:23:02 -08:00
Tejun Heo	5b805f2a76	Merge branch 'cgroup/for-3.7-fixes' into cgroup/for-3.8 This is to receive device_cgroup fixes so that further device_cgroup changes can be made in cgroup/for-3.8. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-06 12:26:23 -08:00
Tejun Heo	1db1e31b1e	Merge branch 'cgroup-rmdir-updates' into cgroup/for-3.8 Pull rmdir updates into for-3.8 so that further callback updates can be put on top. This pull created a trivial conflict between the following two commits. `8c7f6edbda` ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them") `ed95779340` ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs") The former added a field to cgroup_subsys and the latter removed one from it. They happen to be colocated causing the conflict. Keeping what's added and removing what's removed resolves the conflict. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-05 09:21:51 -08:00
Tejun Heo	bcf6de1b91	cgroup: make ->pre_destroy() return void All ->pre_destory() implementations return 0 now, which is the only allowed return value. Make it return void. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com>	2012-11-05 09:16:59 -08:00
Tejun Heo	b25ed609d0	cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup destruction rollback somewhat working. cgroup_rmdir() used to drain CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and helpers were used to allow the task performing rmdir to wait for the next relevant event. Unfortunately, the wait is visible to controllers too and the mechanism got exposed to memcg by `887032670d` ("cgroup avoid permanent sleep at rmdir"). Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is unnecessary. Remove it and all the mechanisms supporting it. Note that memcontrol.c changes are essentially revert of `887032670d` ("cgroup avoid permanent sleep at rmdir"). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com>	2012-11-05 09:16:59 -08:00
Tejun Heo	1a90dd508b	cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy() Because ->pre_destroy() could fail and can't be called under cgroup_mutex, cgroup destruction did something very ugly. 1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise. 2. Release cgroup_mutex and call ->pre_destroy(). 3. Re-grab cgroup_mutex and verify it can still be destroyed; fail otherwise. 4. Continue destroying. In addition to being ugly, it has been always broken in various ways. For example, memcg ->pre_destroy() expects the cgroup to be inactive after it's done but tasks can be attached and detached between #2 and #3 and the conditions that memcg verified in ->pre_destroy() might no longer hold by the time control reaches #3. Now that ->pre_destroy() is no longer allowed to fail. We can switch to the following. 1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise. 2. Deactivate CSS's and mark the cgroup removed thus preventing any further operations which can invalidate the verification from #1. 3. Release cgroup_mutex and call ->pre_destroy(). 4. Re-grab cgroup_mutex and continue destroying. After this change, controllers can safely assume that ->pre_destroy() will only be called only once for a given cgroup and, once ->pre_destroy() is called, the cgroup will stay dormant till it's destroyed. This removes the only reason ->pre_destroy() can fail - new task being attached or child cgroup being created inbetween. Error out path is removed and ->pre_destroy() invocation is open coded in cgroup_rmdir(). v2: cgroup_call_pre_destroy() removal moved to this patch per Michal. Commit message updated per Glauber. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-05 09:16:59 -08:00
Tejun Heo	976c06bccc	cgroup: use cgroup_lock_live_group(parent) in cgroup_create() This patch makes cgroup_create() fail if @parent is marked removed. This is to prepare for further updates to cgroup_rmdir() path. Note that this change isn't strictly necessary. cgroup can only be created via mkdir and the removed marking and dentry removal happen without releasing cgroup_mutex, so cgroup_create() can never race with cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(), cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex rendering the added liveliness check unnecessary. Do it anyway such that locking is contained inside cgroup proper and we don't get nasty surprises if we ever grow another caller of cgroup_create(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-05 09:16:59 -08:00
Tejun Heo	e93160803f	cgroup: kill CSS_REMOVED CSS_REMOVED is one of the several contortions which were necessary to support css reference draining on cgroup removal. All css->refcnts which need draining should be deactivated and verified to equal zero atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts needed to be re-activated and css_tryget() shouldn't fail in the process. This was achieved by letting css_tryget() busy-loop until either the refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set (committing to removal). Now that css refcnt draining is no longer used, there's no need for atomic rollback mechanism. css_tryget() simply can look at the reference count and fail if it's deactivated - it's never getting re-activated. This patch removes CSS_REMOVED and updates __css_tryget() to fail if the refcnt is deactivated. As deactivation and removal are a single step now, they no longer need to be protected against css_tryget() happening from irq context. Remove local_irq_disable/enable() from cgroup_rmdir(). Note that this removes css_is_removed() whose only user is VM_BUG_ON() in memcontrol.c. We can replace it with a check on the refcnt but given that the only use case is a debug assert, I think it's better to simply unexport it. v2: Comment updated and explanation on local_irq_disable/enable() added per Michal Hocko. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>	2012-11-05 09:16:58 -08:00
Tejun Heo	ed95779340	cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs `2ef37d3fe4` ("memcg: Simplify mem_cgroup_force_empty_list error handling") removed the last user of __DEPRECATED_clear_css_refs. This patch removes __DEPRECATED_clear_css_refs and mechanisms to support it. * Conditionals dependent on __DEPRECATED_clear_css_refs removed. * cgroup_clear_css_refs() can no longer fail. All that needs to be done are deactivating refcnts, setting CSS_REMOVED and putting the base reference on each css. Remove cgroup_clear_css_refs() and the failure path, and open-code the loops into cgroup_rmdir(). This patch keeps the two for_each_subsys() loops separate while open coding them. They can be merged now but there are scheduled changes which need them to be separate, so keep them separate to reduce the amount of churn. local_irq_save/restore() from cgroup_clear_css_refs() are replaced with local_irq_disable/enable() for simplicity. This is safe as cgroup_rmdir() is always called with IRQ enabled. Note that this IRQ switching is necessary to ensure that css_tryget() isn't called from IRQ context on the same CPU while lower context is between CSS deactivation and setting CSS_REMOVED as css_tryget() would hang forever in such cases waiting for CSS to be re-activated or CSS_REMOVED set. This will go away soon. v2: cgroup_call_pre_destroy() removal dropped per Michal. Commit message updated to explain local_irq_disable/enable() conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-05 09:16:58 -08:00
Tejun Heo	d878383211	Revert "cgroup: Remove task_lock() from cgroup_post_fork()" This reverts commit `7e3aa30ac8`. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org	2012-10-19 14:09:35 -07:00
Tejun Heo	9bb71308b8	Revert "cgroup: Drop task_lock(parent) on cgroup_fork()" This reverts commit `7e381b0eb1`. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org	2012-10-19 14:08:49 -07:00
Daisuke Nishimura	1f5320d597	cgroup: notify_on_release may not be triggered in some cases notify_on_release must be triggered when the last process in a cgroup is move to another. But if the first(and only) process in a cgroup is moved to another, notify_on_release is not triggered. # mkdir /cgroup/cpu/SRC # mkdir /cgroup/cpu/DST # # echo 1 >/cgroup/cpu/SRC/notify_on_release # echo 1 >/cgroup/cpu/DST/notify_on_release # # sleep 300 & [1] 8629 # # echo 8629 >/cgroup/cpu/SRC/tasks # echo 8629 >/cgroup/cpu/DST/tasks -> notify_on_release for /SRC must be triggered at this point, but it isn't. This is because put_css_set() is called before setting CGRP_RELEASABLE in cgroup_task_migrate(), and is a regression introduce by the commit:74a1166d(cgroups: make procs file writable), which was merged into v3.0. Cc: Ben Blum <bblum@andrew.cmu.edu> Cc: <stable@vger.kernel.org> # v3.0.x and later Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-10-16 17:09:36 -07:00

1 2 3 4 5 ...

432 Commits