linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	47dfe4037e	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...	2014-08-04 10:11:28 -07:00
Linus Torvalds	f2a84170ed	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: - Major reorganization of percpu header files which I think makes things a lot more readable and logical than before. - percpu-refcount is updated so that it requires explicit destruction and can be reinitialized if necessary. This was pulled into the block tree to replace the custom percpu refcnting implemented in blk-mq. - In the process, percpu and percpu-refcount got cleaned up a bit * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits) percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() percpu-refcount: require percpu_ref to be exited explicitly percpu-refcount: use unsigned long for pcpu_count pointer percpu-refcount: add helpers for ->percpu_count accesses percpu-refcount: one bit is enough for REF_STATUS percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() workqueue: stronger test in process_one_work() workqueue: clear POOL_DISASSOCIATED in rebind_workers() percpu: Use ALIGN macro instead of hand coding alignment calculation percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations percpu: preffity percpu header files percpu: use raw_cpu_() to define __this_cpu_() percpu: reorder macros in percpu header files percpu: move {raw\|this}_cpu_() definitions to include/linux/percpu-defs.h percpu: move generic {raw\|this}_cpu__N() definitions to include/asm-generic/percpu.h percpu: only allow sized arch overrides for {raw\|this}_cpu_*() ops percpu: reorganize include/linux/percpu-defs.h percpu: move accessors from include/linux/percpu.h to percpu-defs.h percpu: include/asm-generic/percpu.h should contain only arch-overridable parts percpu: introduce arch_raw_cpu_ptr() ...	2014-08-04 10:09:27 -07:00
Tejun Heo	5de4fa13c4	cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgrp_dfl_root_inhibit_ss_mask determines which subsystems are not supported on the default hierarchy and is currently initialized statically and just includes the debug subsystem. Now that there's cgroup_subsys->dfl_files, we can easily tell which subsystems support the default hierarchy or not. Let's initialize cgrp_dfl_root_inhibit_ss_mask by testing whether cgroup_subsys->dfl_files is NULL. After all, subsystems with NULL ->dfl_files aren't useable on the default hierarchy anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	05ebb6e60f	cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup now distinguishes cftypes for the default and legacy hierarchies more explicitly by using separate arrays and CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only inside cgroup core proper. Let's make it clear that the flags are internal by prefixing them with double underscores. CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The two flags are also collected and assigned bits >= 16 so that they aren't mixed with the published flags. v2: Convert the extra ones in cgroup_exit_cftypes() which are added by revision to the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	a8ddc8215e	cgroup: distinguish the default and legacy hierarchies when handling cftypes Until now, cftype arrays carried files for both the default and legacy hierarchies and the files which needed to be used on only one of them were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This gets confusing very quickly and we may end up exposing interface files to the default hierarchy without thinking it through. This patch makes cgroup core provide separate sets of interfaces for cftype handling so that the cftypes for the default and legacy hierarchies are clearly distinguished. The previous two patches renamed the existing ones so that they clearly indicate that they're for the legacy hierarchies. This patch adds the interface for the default hierarchy and apply them selectively depending on the hierarchy type. * cftypes added through cgroup_subsys->dfl_cftypes and cgroup_add_dfl_cftypes() only show up on the default hierarchy. * cftypes added through cgroup_subsys->legacy_cftypes and cgroup_add_legacy_cftypes() only show up on the legacy hierarchies. * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the same array for the cases where the interface files are identical on both types of hierarchies. * This makes all the existing subsystem interface files legacy-only by default and all subsystems will have no interface file created when enabled on the default hierarchy. Each subsystem should explicitly review and compose the interface for the default hierarchy. * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which makes subsystems which haven't decided the interface files for the default hierarchy to present the legacy files on the default hierarchy so that its behavior on the default hierarchy can be tested. As the awkward name suggests, this is for development only. * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole array isn't used on the default hierarchy. The flag is removed. v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl. v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	2cf669a58d	cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() Currently, cftypes added by cgroup_add_cftypes() are used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype addition functions and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch adds cgroup_add_legacy_cftypes() which currently is a simple wrapper around cgroup_add_cftypes() and replaces all cgroup_add_cftypes() usages with it. While at it, this patch drops a completely spurious return from __hugetlb_cgroup_file_init(). This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	5577964e64	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	a14c6874be	cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] Currently cgroup_base_files[] contains the cgroup core interface files for both legacy and default hierarchies with each file tagged with CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL. This is difficult to read. Let's separate it out to two separate tables, cgroup_dfl_base_files[] and cgroup_legacy_base_files[], and use the appropriate one in cgroup_mkdir() depending on the hierarchy type. This makes tagging each file unnecessary. This patch doesn't introduce any behavior changes. v2: cgroup_dfl_base_files[] was missing the termination entry triggering WARN in cgroup_init_cftypes() for 0day kernel testing robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jet Chen <jet.chen@intel.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	7b9a6ba56e	cgroup: clean up sane_behavior handling After the previous patch to remove sane_behavior support from non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to indicate the default hierarchy while parsing mount options. This patch makes the following cleanups around it. * Don't show it in the mount option. Eventually the default hierarchy will be assigned a different filesystem type. * As sane_behavior is no longer effective on non-default hierarchies and the default hierarchy doesn't accept any mount options, parse_cgroupfs_options() can consider sane_behavior mount option as indicating the default hierarchy and fail if any other options are specified with it. While at it, remove one of the double blank lines in the function. * cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell whether to mount the default hierarchy or not. * As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether to select the default hierarchy or not during mount, it doesn't need to be set in the default hierarchy itself. cgroup_init_early() updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:08 -04:00
Tejun Heo	aa6ec29bee	cgroup: remove sane_behavior support on non-default hierarchies sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>	2014-07-09 10:08:08 -04:00
Tejun Heo	c1d5d42efd	cgroup: make interface file "cgroup.sane_behavior" legacy-only "cgroup.sane_behavior" is added to help distinguishing whether sane_behavior is in effect or not. We now have the default hierarchy where the flag is always in effect and are planning to remove supporting sane behavior on the legacy hierarchies making this file on the default hierarchy rather pointless. Let's make it legacy only and thus always zero. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:08 -04:00
Tejun Heo	7450e90bbb	cgroup: remove CGRP_ROOT_OPTION_MASK cgroup_root->flags only contains CGRP_ROOT_* flags and there's no reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK. This doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:07 -04:00
Tejun Heo	af0ba6789c	cgroup: implement cgroup_subsys->depends_on Currently, the blkio subsystem attributes all of writeback IOs to the root. One of the issues is that there's no way to tell who originated a writeback IO from block layer. Those IOs are usually issued asynchronously from a task which didn't have anything to do with actually generating the dirty pages. The memory subsystem, when enabled, already keeps track of the ownership of each dirty page and it's desirable for blkio to piggyback instead of adding its own per-page tag. blkio piggybacking on memory is an implementation detail which preferably should be handled automatically without requiring explicit userland action. To achieve that, this patch implements cgroup_subsys->depends_on which contains the mask of subsystems which should be enabled together when the subsystem is enabled. The previous patches already implemented the support for enabled but invisible subsystems and cgroup_subsys->depends_on can be easily implemented by updating cgroup_refresh_child_subsys_mask() so that it calculates cgroup->child_subsys_mask considering cgroup_subsys->depends_on of the explicitly enabled subsystems. Documentation/cgroups/unified-hierarchy.txt is updated to explain that subsystems may not become immediately available after being unused from userland and that dependency could be a factor in it. As subsystems may already keep residual references, this doesn't significantly change how subsystem rebinding can be used. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	b4536f0cab	cgroup: implement cgroup_subsys->css_reset() cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". The previous patches added support for explicitly and implicitly enabled subsystems and showing/hiding their interface files. An explicitly enabled subsystem may become implicitly enabled if it's turned off through "cgroup.subtree_control" but there are subsystems depending on it. In such cases, the subsystem, as it's turned off when seen from userland, shouldn't enforce any resource control. Also, the subsystem may be explicitly turned on later again and its interface files should be as close to the intial state as possible. This patch adds cgroup_subsys->css_reset() which is invoked when a css is hidden. The callback should disable resource control and reset the state to the vanilla state. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	f63070d350	cgroup: make interface files visible iff enabled on cgroup->subtree_control cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". The preceding patch distinguished cgroup->subtree_control and ->child_subsys_mask where the former is the subsystems explicitly configured by the userland and the latter is all enabled subsystems currently is equal to the former but will include subsystems implicitly enabled through dependency. Subsystems which are enabled due to dependency shouldn't be visible to userland. This patch updates cgroup_subtree_control_write() and create_css() such that interface files are not created for implicitly enabled subsytems. * @visible paramter is added to create_css(). Interface files are created only when true. * If an already implicitly enabled subsystem is turned on through "cgroup.subtree_control", the existing css should be used. css draining is skipped. * cgroup_subtree_control_write() computes the new target cgroup->child_subsys_mask and create/kill or show/hide csses accordingly. As the two subsystem masks are still kept identical, this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	667c249171	cgroup: introduce cgroup->subtree_control cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". Previously, cgroup->child_subsys_mask directly reflected "cgroup.subtree_control" and the enabled subsystems in the child cgroups. This patch adds cgroup->subtree_control which "cgroup.subtree_control" operates on. cgroup->child_subsys_mask is now calculated from cgroup->subtree_control by cgroup_refresh_child_subsys_mask(), which sets it identical to cgroup->subtree_control for now. This will allow using cgroup->child_subsys_mask for all the enabled subsystems including the implicit ones and ->subtree_control for tracking the explicitly requested ones. This patch keeps the two masks identical and doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:56 -04:00
Tejun Heo	c29adf24e0	cgroup: reorganize cgroup_subtree_control_write() Make the following two reorganizations to cgroup_subtree_control_write(). These are to prepare for future changes and shouldn't cause any functional difference. * Move availability above css offlining wait. * Move cgrp->child_subsys_mask update above new css creation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:56 -04:00
Li Zefan	3a32bd72d7	cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() We've converted cgroup to kernfs so cgroup won't be intertwined with vfs objects and locking, but there are dark areas. Run two instances of this script concurrently: for ((; ;)) { mount -t cgroup -o cpuacct xxx /cgroup umount /cgroup } After a while, I saw two mount processes were stuck at retrying, because they were waiting for a subsystem to become free, but the root associated with this subsystem never got freed. This can happen, if thread A is in the process of killing superblock but hasn't called percpu_ref_kill(), and at this time thread B is mounting the same cgroup root and finds the root in the root list and performs percpu_ref_try_get(). To fix this, we try to increase both the refcnt of the superblock and the percpu refcnt of cgroup root. v2: - we should try to get both the superblock refcnt and cgroup_root refcnt, because cgroup_root may have no superblock assosiated with it. - adjust/add comments. tj: Updated comments. Renamed @sb to @pinned_sb. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-30 10:16:26 -04:00
Li Zefan	970317aa48	cgroup: fix mount failure in a corner case # cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-30 10:16:25 -04:00
Tejun Heo	9a1049da9b	percpu-refcount: require percpu_ref to be exited explicitly Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com>	2014-06-28 08:10:14 -04:00
Li Zefan	99bae5f941	cgroup: fix broken css_has_online_children() After running: # mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \ rmdir /cgroup/sub && umount /cgroup I found the cgroup root still existed: # cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 1 1 1 ... It turned out css_has_online_children() is broken. Signed-off-by: Li Zefan <lizefan@huawei.com> Sigend-off-by: Tejun Heo <tj@kernel.org>	2014-06-17 18:52:53 -04:00
Linus Torvalds	14208b0ec5	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...	2014-06-09 15:03:33 -07:00
Li Zefan	c731ae1d0f	cgroup: disallow disabled controllers on the default hierarchy After booting with cgroup_disable=memory, I still saw memcg files in the default hierarchy, and I can write to them, though it won't take effect. # dmesg ... Disabling memory control group subsystem ... # mount -t cgroup -o __DEVEL__sane_behavior xxx /cgroup # ls /cgroup ... memory.failcnt memory.move_charge_at_immigrate memory.force_empty memory.numa_stat memory.limit_in_bytes memory.oom_control ... # cat /cgroup/memory.usage_in_bytes 0 tj: Minor comment update. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-05 09:52:51 -04:00
Li Zefan	1f779fb28a	cgroup: don't destroy the default root The default root is allocated and initialized at boot phase, so we shouldn't destroy the default root when it's umounted, otherwise it will lead to disaster. Just try mount and then umount the default root, and the kernel will crash immediately. v2: - No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun) - Better call cgroup_put() for the default root in kill_sb(). (Tejun) - Add a comment. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-04 21:21:51 -04:00
Jianyu Zhan	c9482a5bdc	kernfs: move the last knowledge of sysfs out from kernfs There is still one residue of sysfs remaining: the sb_magic SYSFS_MAGIC. However this should be kernfs user specific, so this patch moves it out. Kerrnfs user should specify their magic number while mouting. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-03 08:11:18 -07:00
Tejun Heo	5533e01144	cgroup: disallow debug controller on the default hierarchy The debug controller, as its name suggests, exposes cgroup core internals to userland to aid debugging. Unfortunately, except for the name, there's no provision to prevent its usage in production configurations and the controller is widely enabled and mounted leaking internal details to userland. Like most other debug information, the information exposed by debug isn't interesting even for debugging itself once the related parts are working reliably. This controller has no reason for existing. This patch implements cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems on the default hierarchy and adds the debug subsystem to it so that it can be gradually deprecated as usages move towards the unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-19 16:37:06 -04:00
Tejun Heo	f3d4650015	cgroup: convert cgroup_has_live_children() into css_has_online_children() Now that cgroup liveliness and css onliness are the same state, convert cgroup_has_live_children() into css_has_online_children() so that it can be used for actual csses too. The function now uses css_for_each_child() for iteration and is published. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:52 -04:00
Tejun Heo	184faf3232	cgroup: use CSS_ONLINE instead of CGRP_DEAD Use CSS_ONLINE on the self css to indicate whether a cgroup has been killed instead of CGRP_DEAD. This will allow re-using css online test for cgroup liveliness test. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:51 -04:00
Tejun Heo	c2931b70a3	cgroup: iterate cgroup_subsys_states directly Currently, css_next_child() is implemented as finding the next child cgroup which has the css enabled, which used to be the only way to do it as only cgroups participated in sibling lists and thus could be iteratd. This works as long as what's required during iteration is not missing online csses; however, it turns out that there are use cases where offlined but not yet released csses need to be iterated. This is difficult to implement through cgroup iteration the unified hierarchy as there may be multiple dying csses for the same subsystem associated with single cgroup. After the recent changes, the cgroup self and regular csses behave identically in how they're linked and unlinked from the sibling lists including assertion of CSS_RELEASED and css_next_child() can simply switch to iterating csses directly. This both simplifies the logic and ensures that all visible non-released csses are included in the iteration whether there are multiple dying csses for a subsystem or not. As all other iterators depend on css_next_child() for sibling iteration, this changes behaviors of all css iterators. Add and update explanations on the css states which are included in traversal to all iterators. As css iteration could always contain offlined csses, this shouldn't break any of the current users and new usages which need iteration of all on and offline csses can make use of the new semantics. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:51 -04:00
Tejun Heo	de3f034182	cgroup: introduce CSS_RELEASED and reduce css iteration fallback window css iterations allow the caller to drop RCU read lock. As long as the caller keeps the current position accessible, it can simply re-grab RCU read lock later and continue iteration. This is achieved by using CGRP_DEAD to detect whether the current positions next pointer is safe to dereference and if not re-iterate from the beginning to the next position using ->serial_nr. CGRP_DEAD is used as the marker to invalidate the next pointer and the only requirement is that the marker is set before the next sibling starts its RCU grace period. Because CGRP_DEAD is set at the end of cgroup_destroy_locked() but the cgroup is unlinked when the reference count reaches zero, we currently have a rather large window where this fallback re-iteration logic can be triggered. This patch introduces CSS_RELEASED which is set when a css is unlinked from its sibling list. This still keeps the re-iteration logic working while drastically reducing the window of its activation. While at it, rewrite the comment in css_next_child() to reflect the new flag and better explain the synchronization. This will also enable iterating csses directly instead of through cgroups. v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by CSS_NO_REF. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	0cb51d71c1	cgroup: move cgroup->serial_nr into cgroup_subsys_state We're moving towards using cgroup_subsys_states as the fundamental structural blocks. All csses including the cgroup->self and actual ones now form trees through css->children and ->sibling which follow the same rules as what cgroup->children and ->sibling followed. This patch moves cgroup->serial_nr which is used to implement css iteration into css. Note that all csses, regardless of their types, allocate their serial numbers from the same monotonically increasing counter. This doesn't affect the ordering needed by css iteration or cause any other material behavior changes. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	1fed1b2e36	cgroup: link all cgroup_subsys_states in their sibling lists Currently, while all csses have ->children and ->sibling, only the self csses of cgroups make use of them. This patch makes all other csses to link themselves on the sibling lists too. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	d5c419b68e	cgroup: move cgroup->sibling and ->children into cgroup_subsys_state We're moving towards using cgroup_subsys_states as the fundamental structural blocks. Let's move cgroup->sibling and ->children into cgroup_subsys_state. This is pure move without functional change and only cgroup->self's fields are actually used. Other csses will make use of the fields later. While at it, update init_and_link_css() so that it zeroes the whole css before initializing it and remove explicit zeroing of ->flags. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:48 -04:00
Tejun Heo	d51f39b05c	cgroup: remove cgroup->parent cgroup->parent is redundant as cgroup->self.parent can also be used to determine the parent cgroup and we're moving towards using cgroup_subsys_states as the fundamental structural blocks. This patch introduces cgroup_parent() which follows cgroup->self.parent and removes cgroup->parent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:48 -04:00
Tejun Heo	5c9d535b89	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:48 -04:00
Tejun Heo	3b514d24e2	cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css `9395a45004` ("cgroup: enable refcnting for root csses") enabled reference counting for root csses (cgroup_subsys_states) so that cgroup's self csses can be used to manage the lifetime of the containing cgroups. Unfortunately, this change was incorrect. During early init, cgrp_dfl_root self css refcnt is used. percpu_ref can't initialized during early init and its initialization is deferred till cgroup_init() time. This means that cpu was using percpu_ref which wasn't properly initialized. Due to the way percpu variables are laid out on x86, this didn't blow up immediately on x86 but ended up incrementing and decrementing the percpu variable at offset zero, whatever it may be; however, on other archs, this caused fault and early boot failure. As cgroup self csses for root cgroups of non-dfl hierarchies need working refcounting, we can't revert `9395a45004`. This patch adds CSS_NO_REF which explicitly inhibits reference counting on the css and sets it on all normal (non-self) csses and cgroup_dfl_root self css. v2: cgrp_dfl_root.self is the offending one. Set the flag on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Warren <swarren@nvidia.com> Tested-by: Stephen Warren <swarren@nvidia.com> Fixes: `9395a45004` ("cgroup: enable refcnting for root csses")	2014-05-16 13:22:47 -04:00
Tejun Heo	9d755d33f0	cgroup: use cgroup->self.refcnt for cgroup refcnting Currently cgroup implements refcnting separately using atomic_t cgroup->refcnt. The destruction paths of cgroup and css are rather complex and bear a lot of similiarities including the use of RCU and bouncing to a work item. This patch makes cgroup use the refcnt of self css for refcnting instead of using its own. This makes cgroup refcnting use css's percpu refcnt and share the destruction mechanism. * css_release_work_fn() and css_free_work_fn() are updated to handle both csses and cgroups. This is a bit messy but should do until we can make cgroup->self a full css, which currently can't be done thanks to multiple hierarchies. * cgroup_destroy_locked() now performs percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp). * Negative refcnt sanity check in cgroup_get() is no longer necessary as percpu_ref already handles it. * Similarly, as a cgroup which hasn't been killed will never be released regardless of its refcnt value and percpu_ref has sanity check on kill, cgroup_is_dead() sanity check in cgroup_put() is no longer necessary. * As whether a refcnt reached zero or not can only be decided after the reference count is killed, cgroup_root->cgrp's refcnting can no longer be used to decide whether to kill the root or not. Let's make cgroup_kill_sb() explicitly initiate destruction if the root doesn't have any children. This makes sense anyway as unmounted cgroup hierarchy without any children should be destroyed. While this is a bit messy, this will allow pushing more bookkeeping towards cgroup->self and thus handling cgroups and csses in more uniform way. In the very long term, it should be possible to introduce a base subsystem and convert the self css to a proper one making things whole lot simpler and unified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	9395a45004	cgroup: enable refcnting for root csses Currently, css_get(), css_tryget() and css_tryget_online() are noops for root csses as an optimization; however, we're planning to use css refcnts to track of cgroup lifetime too and root cgroups also need to be reference counted. Since css has been converted to percpu_refcnt, the overhead of refcnting is miniscule and this optimization isn't too meaningful anymore. Furthermore, controllers which optimize the root cgroup often never even invoke these functions in their hot paths. This patch enables refcnting for root csses too. This makes CSS_ROOT flag unused and removes it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	25e15d8350	cgroup: bounce css release through css->destroy_work css release is planned to do more and would require process context. Bounce it through css->destroy_work. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	249f3468a2	cgroup: remove cgroup_destory_css_killed() cgroup_destroy_css_killed() is cgroup destruction stage which happens after all csses are offlined. After the recent updates, it no longer does anything other than putting the base reference. This patch removes the function and makes cgroup_destroy_locked() put the base ref at the end isntead. This also makes cgroup->nr_css unnecessary. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	4e4e284723	cgroup: move cgroup->sibling unlinking to cgroup_put() Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to cgroup_put(). This is later but still before the RCU grace period, so it doesn't break css_next_child() although there now is a larger window in which a dead cgroup is visible during css iteration. As css iteration always could have included offline csses, this doesn't affect correctness; however, it does make css_next_child() fall back to reiterting mode more often. This also makes cgroup_put() directly take cgroup_mutex, which limits where it can be called from. These are not immediately problematic and will be dealt with later. This change enables simplification of cgroup destruction path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	9e4173e1f2	cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked() Currently, check_for_release() on the parent of a destroyed cgroup is invoked from cgroup_destroy_css_killed(). This is because this is where the destroyed cgroup can be removed from the parent's children list. check_for_release() tests the emptiness of the list directly, so invoking it before removing the cgroup from the list makes it think that the parent still has children even when it no longer does. This patch updates check_for_release() to use cgroup_has_live_children() instead of directly testing ->children emptiness and moves check_for_release(parent) earlier to the end of cgroup_destroy_locked(). As cgroup_has_live_children() ignores cgroups marked DEAD, check_for_release() functions correctly as long as it's called after asserting DEAD. This makes release notification slightly more timely and more importantly enables further simplification of cgroup destruction path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	cbc125efad	cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked() We're expecting another user. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	9d800df12d	cgroup: rename cgroup->dummy_css to ->self and move it to the top cgroup->dummy_css is used as the placeholder css when performing css oriended operations on the cgroup. We're gonna shift more cgroup management to this css. Let's rename it to ->self and move it to the top. This is pure rename and field relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:00 -04:00
Tejun Heo	a015edd26e	cgroup: use restart_syscall() for mount retries cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root which is being destroyed. The retry currently loops inside cgroup_mount() proper. This patch makes it return with restart_syscall() instead so that retry travels out to userland boundary. This slightly simplifies the logic and more importantly makes the retry logic behave better when the wait for some reason becomes lengthy or infinite by allowing the operation to be suspended or terminated from userland. v2: The original patch forgot to free memory allocated for @opts. Fixed. Caught by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:00 -04:00
Tejun Heo	8353da1f91	cgroup: remove cgroup_tree_mutex cgroup_tree_mutex was introduced to work around the circular dependency between cgroup_mutex and kernfs active protection - some kernfs file and directory operations needed cgroup_mutex putting cgroup_mutex under active protection but cgroup also needs to be able to access cgroup hierarchies and cftypes to determine which kernfs_nodes need to be removed. cgroup_tree_mutex nested above both cgroup_mutex and kernfs active protection and used to protect the hierarchy and cftypes. While this worked, it added a lot of double lockings and was generally cumbersome. kernfs provides a mechanism to opt out of active protection and cgroup was already using it for removal and subtree_control. There's no reason to mix both methods of avoiding circular locking dependency and the preceding cgroup_kn_lock_live() changes applied it to all relevant cgroup kernfs operations making it unnecessary to nest cgroup_mutex under kernfs active protection. The previous patch reversed the original lock ordering and put cgroup_mutex above kernfs active protection. After these changes, all cgroup_tree_mutex usages are now accompanied by cgroup_mutex making the former completely redundant. This patch removes cgroup_tree_mutex and all its usages. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	01f6474ce0	cgroup: nest kernfs active protection under cgroup_mutex After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no longer nested below kernfs active protection. The two don't have any relationship now. This patch nests kernfs active protection under cgroup_mutex. All cftype operations now require both cgroup_tree_mutex and cgroup_mutex, temporary cgroup_mutex releases over kernfs operations are removed, and cgroup_add/rm_cftypes() grab both mutexes. This makes cgroup_tree_mutex redundant, which will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	e76ecaeef6	cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods Make __cgroup_procs_write() and cgroup_release_agent_write() use cgroup_kn_lock_live() and cgroup_kn_unlock() instead of cgroup_lock_live_group(). This puts the operations under both cgroup_tree_mutex and cgroup_mutex protection without circular dependency from kernfs active protection. Also, this means that cgroup_mutex is no longer nested below kernfs active protection. There is no longer any place where the two locks interact. This leaves cgroup_lock_live_group() without any user. Removed. This will help simplifying cgroup locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	a9746d8da7	cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write() share the logic to break active protection so that they can grab cgroup_tree_mutex which nests above active protection and/or remove self. Factor out this logic into cgroup_kn_lock_live() and cgroup_kn_unlock(). This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	cfc79d5bec	cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() The ->priv field of a cgroup directory kernfs_node points back to the cgroup. This field is RCU cleared in cgroup_destroy_locked() for non-kernfs accesses from css_tryget_from_dir() and cgroupstats_build(). As these are only applicable to cgroups which finished creation successfully and fully initialized cgroups are always removed by cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir(). This will help simplifying cgroup locking and shouldn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00

1 2 3 4 5 ...

711 Commits