linux/include
Tejun Heo 97b27821b4 writeback, memcg: Implement foreign dirty flushing
There's an inherent mismatch between memcg and writeback.  The former
trackes ownership per-page while the latter per-inode.  This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.

Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases.  For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode.  B owns the inode and
A's memory is limited far below B's.  A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback.  A will be slowed down without
a way to make writeback of the dirty pages happen.

This patch implements foreign dirty recording and foreign mechanism so
that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages.  Please see the
comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
details.

A reproducer follows.

write-range.c::

  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>

  static const char *usage = "write-range FILE START SIZE\n";

  int main(int argc, char **argv)
  {
	  int fd;
	  unsigned long start, size, end, pos;
	  char *endp;
	  char buf[4096];

	  if (argc < 4) {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  fd = open(argv[1], O_WRONLY);
	  if (fd < 0) {
		  perror("open");
		  return 1;
	  }

	  start = strtoul(argv[2], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  size = strtoul(argv[3], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  end = start + size;

	  while (1) {
		  for (pos = start; pos < end; ) {
			  long bread, bwritten = 0;

			  if (lseek(fd, pos, SEEK_SET) < 0) {
				  perror("lseek");
				  return 1;
			  }

			  bread = read(0, buf, sizeof(buf) < end - pos ?
					       sizeof(buf) : end - pos);
			  if (bread < 0) {
				  perror("read");
				  return 1;
			  }
			  if (bread == 0)
				  return 0;

			  while (bwritten < bread) {
				  long this;

				  this = write(fd, buf + bwritten,
					       bread - bwritten);
				  if (this < 0) {
					  perror("write");
					  return 1;
				  }

				  bwritten += this;
				  pos += bwritten;
			  }
		  }
	  }
  }

repro.sh::

  #!/bin/bash

  set -e
  set -x

  sysctl -w vm.dirty_expire_centisecs=300000
  sysctl -w vm.dirty_writeback_centisecs=300000
  sysctl -w vm.dirtytime_expire_seconds=300000
  echo 3 > /proc/sys/vm/drop_caches

  TEST=/sys/fs/cgroup/test
  A=$TEST/A
  B=$TEST/B

  mkdir -p $A $B
  echo "+memory +io" > $TEST/cgroup.subtree_control
  echo $((1<<30)) > $A/memory.high
  echo $((32<<30)) > $B/memory.high

  rm -f testfile
  touch testfile
  fallocate -l 4G testfile

  echo "Starting B"

  (echo $BASHPID > $B/cgroup.procs
   pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &

  echo "Waiting 10s to ensure B claims the testfile inode"
  sleep 5
  sync
  sleep 5
  sync
  echo "Starting A"

  (echo $BASHPID > $A/cgroup.procs
   pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))

v2: Added comments explaining why the specific intervals are being used.

v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
    flushing while avoding possible livelocks.

v4: Use get_jiffies_64() and time_before/after64() instead of raw
    jiffies_64 and arthimetic comparisons as suggested by Jan.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 09:22:38 -06:00
..
acpi It's been a relatively busy cycle for docs: 2019-07-09 12:34:26 -07:00
asm-generic asm-generic: fix -Wtype-limits compiler warnings 2019-08-03 07:02:01 -07:00
clocksource clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic 2019-07-03 11:00:59 +02:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2019-07-08 20:57:08 -07:00
drm drm/fb-helper: Instanciate shadow FB if configured in device's mode_config 2019-08-01 15:01:35 +02:00
dt-bindings ARM: Device-tree updates 2019-07-19 17:19:24 -07:00
keys request_key improvements 2019-07-08 19:19:37 -07:00
kvm KVM: arm/arm64: Support chained PMU counters 2019-07-05 13:56:22 +01:00
linux writeback, memcg: Implement foreign dirty flushing 2019-08-27 09:22:38 -06:00
math-emu
media media updates for v5.3-rc1 2019-07-09 09:47:22 -07:00
misc powerpc updates for 5.3 2019-07-13 16:08:36 -07:00
net tcp: be more careful in tcp_fragment() 2019-07-21 20:41:24 -07:00
pcmcia It's been a relatively busy cycle for docs: 2019-07-09 12:34:26 -07:00
ras
rdma RDMA/devices: Remove the lock around remove_client_context 2019-08-01 11:44:48 -04:00
scsi scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure 2019-07-29 21:12:35 -04:00
soc Merge branch 'pdf_fixes_v1' of https://git.linuxtv.org/mchehab/experimental into mauro 2019-07-22 13:51:20 -06:00
sound SPDX fixes for 5.3-rc2 2019-07-28 10:00:06 -07:00
target
trace tracing: Fix header include guards in trace event headers 2019-07-30 21:49:06 -04:00
uapi virtio, vhost: bugfixes 2019-07-29 11:34:12 -07:00
vdso vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h 2019-06-26 07:28:09 +02:00
video drm main pull request for v5.3-rc1 (sans mm changes) 2019-07-15 19:04:27 -07:00
xen xen: avoid link error on ARM 2019-07-31 08:14:12 +02:00
Kbuild kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist. 2019-07-21 11:43:43 -07:00