Commit Graph

753091 Commits

Author SHA1 Message Date
Coly Li ecb2ba8cb8 bcache: add wait_for_kthread_stop() in bch_allocator_thread()
When CACHE_SET_IO_DISABLE is set on cache set flags, bcache allocator
thread routine bch_allocator_thread() may stop the while-loops and
exit. Then it is possible to observe the following kernel oops message,

[  631.068366] bcache: bch_btree_insert() error -5
[  631.069115] bcache: cached_dev_detach_finish() Caching disabled for sdf
[  631.070220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  631.070250] PGD 0 P4D 0
[  631.070261] Oops: 0002 [#1] SMP PTI
[snipped]
[  631.070578] Workqueue: events cache_set_flush [bcache]
[  631.070597] RIP: 0010:exit_creds+0x1b/0x50
[  631.070610] RSP: 0018:ffffc9000705fe08 EFLAGS: 00010246
[  631.070626] RAX: 0000000000000001 RBX: ffff880a622ad300 RCX: 000000000000000b
[  631.070645] RDX: 0000000000000601 RSI: 000000000000000c RDI: 0000000000000000
[  631.070663] RBP: ffff880a622ad300 R08: ffffea00190c66e0 R09: 0000000000000200
[  631.070682] R10: ffff880a48123000 R11: ffff880000000000 R12: 0000000000000000
[  631.070700] R13: ffff880a4b160e40 R14: ffff880a4b160000 R15: 0ffff880667e2530
[  631.070719] FS:  0000000000000000(0000) GS:ffff880667e00000(0000) knlGS:0000000000000000
[  631.070740] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  631.070755] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000003606e0
[  631.070774] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  631.070793] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  631.070811] Call Trace:
[  631.070828]  __put_task_struct+0x55/0x160
[  631.070845]  kthread_stop+0xee/0x100
[  631.070863]  cache_set_flush+0x11d/0x1a0 [bcache]
[  631.070879]  process_one_work+0x146/0x340
[  631.070892]  worker_thread+0x47/0x3e0
[  631.070906]  kthread+0xf5/0x130
[  631.070917]  ? max_active_store+0x60/0x60
[  631.070930]  ? kthread_bind+0x10/0x10
[  631.070945]  ret_from_fork+0x35/0x40
[snipped]
[  631.071017] RIP: exit_creds+0x1b/0x50 RSP: ffffc9000705fe08
[  631.071033] CR2: 0000000000000000
[  631.071045] ---[ end trace 011c63a24b22c927 ]---
[  631.071085] bcache: bcache_device_free() bcache0 stopped

The reason is when cache_set_flush() tries to call kthread_stop() to stop
allocator thread, but it exits already due to cache device I/O errors.

This patch adds wait_for_kthread_stop() at tail of bch_allocator_thread(),
to prevent the thread routine exiting directly. Then the allocator thread
can be blocked at wait_for_kthread_stop() and wait for cache_set_flush()
to stop it by calling kthread_stop().

changelog:
v3: add Reviewed-by from Hannnes.
v2: not directly return from allocator_wait(), move 'return 0' to tail of
    bch_allocator_thread().
v1: initial version.

Fixes: 771f393e8f ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:13 -06:00
Coly Li bf78980fcc bcache: count backing device I/O error for writeback I/O
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
counts backing device I/O requets and set dc->io_disable to true if error
counters exceeds dc->io_error_limit. But it only counts I/O errors for
regular I/O request, neglects errors of write back I/Os when backing device
is offline.

This patch counts the errors of writeback I/Os, in dirty_endio() if
bio->bi_status is  not 0, it means error happens when writing dirty keys
to backing device, then bch_count_backing_io_errors() is called.

By this fix, even there is no reqular I/O request coming, if writeback I/O
errors exceed dc->io_error_limit, the bcache device may still be stopped
for the broken backing device.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:12 -06:00
Coly Li 6147305c73 bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev") tries
to stop bcache device by calling bcache_device_stop() when too many I/O
errors happened on backing device. But if there is internal I/O happening
on cache device (writeback scan, garbage collection, etc), a regular I/O
request triggers the internal I/Os may still holds a refcount of dc->count,
and the refcount may only be dropped after the internal I/O stopped.

By this patch, bch_cached_dev_error() will check if the backing device is
attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
flags of this cache set. Then internal I/Os on cache device will be
rejected and stopped immediately, and the bcache device can be stopped.

For people who are not familiar with the interesting refcount dependance,
let me explain a bit more how the fix works. Example the writeback thread
will scan cache device for dirty data writeback purpose. Before it stopps,
it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
the internal I/O will stopped and the while-loop in bch_writeback_thread()
quits and calls cached_dev_put() to drop dc->count. If this is the last
refcount to drop, then cached_dev_detach_finish() will be called. In this
call back function, in turn closure_put(dc->disk.cl) is called to drop a
refcount of closure dc->disk.cl. If this is the last refcount of this
closure to drop, then cached_dev_flush() will be called. Then the cached
device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
can not be stopped until all inernal cache device I/O stopped. For large
size cache device, and writeback thread competes locks with gc thread,
there might be a quite long time to wait.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:10 -06:00
Coly Li 6e916a7eb1 bcache: store disk name in struct cache and struct cached_dev
Current code uses bdevname() or bio_devname() to reference gendisk
disk name when bcache needs to display the disk names in kernel message.
It was safe before bcache device failure handling patch set merged in,
because when devices are failed, there was deadlock to prevent bcache
printing error messages with gendisk disk name. But after the failure
handling patch set merged, the deadlock is fixed, so it is possible
that the gendisk structure bdev->hd_disk is released when bdevname() is
called to reference bdev->bd_disk->disk_name[]. This is why I receive
bug report of NULL pointers deference panic.

This patch stores gendisk disk name in a buffer inside struct cache and
struct cached_dev, then print out the offline device name won't reference
bdev->hd_disk anymore. And this patch also avoids extra function calls
of bdevname() and bio_devnmae().

Changelog:
v3, add Reviewed-by from Hannes.
v2, call bdevname() earlier in register_bdev()
v1, first version with segguestion from Junhui Tang.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Fixes: 5138ac6748 ("bcache: fix misleading error message in bch_count_io_errors()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:08 -06:00
Joerg Roedel a85894cd77 iommu/vt-d: Use WARN_ON_ONCE instead of BUG_ON in qi_flush_dev_iotlb()
A misaligned address is only worth a warning, and not
stopping the while execution path with a BUG_ON().

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:32:11 +02:00
Changbin Du 0dfc0c792d iommu/vt-d: fix shift-out-of-bounds in bug checking
It allows to flush more than 4GB of device TLBs. So the mask should be
64bit wide. UBSAN captured this fault as below.

[    3.760024] ================================================================================
[    3.768440] UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1348:3
[    3.774864] shift exponent 64 is too large for 32-bit type 'int'
[    3.780853] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G     U            4.17.0-rc1+ #89
[    3.788661] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[    3.796034] Call Trace:
[    3.798472]  <IRQ>
[    3.800479]  dump_stack+0x90/0xfb
[    3.803787]  ubsan_epilogue+0x9/0x40
[    3.807353]  __ubsan_handle_shift_out_of_bounds+0x10e/0x170
[    3.812916]  ? qi_flush_dev_iotlb+0x124/0x180
[    3.817261]  qi_flush_dev_iotlb+0x124/0x180
[    3.821437]  iommu_flush_dev_iotlb+0x94/0xf0
[    3.825698]  iommu_flush_iova+0x10b/0x1c0
[    3.829699]  ? fq_ring_free+0x1d0/0x1d0
[    3.833527]  iova_domain_flush+0x25/0x40
[    3.837448]  fq_flush_timeout+0x55/0x160
[    3.841368]  ? fq_ring_free+0x1d0/0x1d0
[    3.845200]  ? fq_ring_free+0x1d0/0x1d0
[    3.849034]  call_timer_fn+0xbe/0x310
[    3.852696]  ? fq_ring_free+0x1d0/0x1d0
[    3.856530]  run_timer_softirq+0x223/0x6e0
[    3.860625]  ? sched_clock+0x5/0x10
[    3.864108]  ? sched_clock+0x5/0x10
[    3.867594]  __do_softirq+0x1b5/0x6f5
[    3.871250]  irq_exit+0xd4/0x130
[    3.874470]  smp_apic_timer_interrupt+0xb8/0x2f0
[    3.879075]  apic_timer_interrupt+0xf/0x20
[    3.883159]  </IRQ>
[    3.885255] RIP: 0010:poll_idle+0x60/0xe7
[    3.889252] RSP: 0018:ffffb1b201943e30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[    3.896802] RAX: 0000000080200000 RBX: 000000000000008e RCX: 000000000000001f
[    3.903918] RDX: 0000000000000000 RSI: 000000002819aa06 RDI: 0000000000000000
[    3.911031] RBP: ffff9e93c6b33280 R08: 00000010f717d567 R09: 000000000010d205
[    3.918146] R10: ffffb1b201943df8 R11: 0000000000000001 R12: 00000000e01b169d
[    3.925260] R13: 0000000000000000 R14: ffffffffb12aa400 R15: 0000000000000000
[    3.932382]  cpuidle_enter_state+0xb4/0x470
[    3.936558]  do_idle+0x222/0x310
[    3.939779]  cpu_startup_entry+0x78/0x90
[    3.943693]  start_secondary+0x205/0x2e0
[    3.947607]  secondary_startup_64+0xa5/0xb0
[    3.951783] ================================================================================

Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:32:11 +02:00
Shameer Kolothum cd2c9fcf5c iommu/dma: Move PCI window region reservation back into dma specific path.
This pretty much reverts commit 273df96353 ("iommu/dma: Make PCI
window reservation generic")  by moving the PCI window region
reservation back into the dma specific path so that these regions
doesn't get exposed via the IOMMU API interface. With this change,
the vfio interface will report only iommu specific reserved regions
to the user space.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Fixes: 273df96353 ('iommu/dma: Make PCI window reservation generic')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:32:10 +02:00
Heiko Stuebner 2f8c7f2e76 iommu/rockchip: Make clock handling optional
iommu clocks are optional, so the driver should not fail if they are not
present. Instead just set the number of clocks to 0, which the clk-blk APIs
can handle just fine.

Fixes: f2e3a5f557 ("iommu/rockchip: Control clocks needed to access the IOMMU")
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:32:10 +02:00
Arnd Bergmann 94c793acca iommu/amd: Hide unused iommu_table_lock
The newly introduced lock is only used when CONFIG_IRQ_REMAP is enabled:

drivers/iommu/amd_iommu.c:86:24: error: 'iommu_table_lock' defined but not used [-Werror=unused-variable]
 static DEFINE_SPINLOCK(iommu_table_lock);

This moves the definition next to the user, within the #ifdef protected
section of the file.

Fixes: ea6166f4b8 ("iommu/amd: Split irq_lookup_table out of the amd_iommu_devtable_lock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:31:57 +02:00
Jagannathan Raman aa7528fe35 iommu/vt-d: Fix usage of force parameter in intel_ir_reconfigure_irte()
It was noticed that the IRTE configured for guest OS kernel
was over-written while the guest was running. As a result,
vt-d Posted Interrupts configured for the guest are not being
delivered directly, and instead bounces off the host. Every
interrupt delivery takes a VM Exit.

It was noticed that the following stack is doing the over-write:
[  147.463177]  modify_irte+0x171/0x1f0
[  147.463405]  intel_ir_set_affinity+0x5c/0x80
[  147.463641]  msi_domain_set_affinity+0x32/0x90
[  147.463881]  irq_do_set_affinity+0x37/0xd0
[  147.464125]  irq_set_affinity_locked+0x9d/0xb0
[  147.464374]  __irq_set_affinity+0x42/0x70
[  147.464627]  write_irq_affinity.isra.5+0xe1/0x110
[  147.464895]  proc_reg_write+0x38/0x70
[  147.465150]  __vfs_write+0x36/0x180
[  147.465408]  ? handle_mm_fault+0xdf/0x200
[  147.465671]  ? _cond_resched+0x15/0x30
[  147.465936]  vfs_write+0xad/0x1a0
[  147.466204]  SyS_write+0x52/0xc0
[  147.466472]  do_syscall_64+0x74/0x1a0
[  147.466744]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

reversing the sense of force check in intel_ir_reconfigure_irte()
restores proper posted interrupt functionality

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Fixes: d491bdff88 ('iommu/vt-d: Reevaluate vector configuration on activate()')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-05-03 16:31:39 +02:00
Linus Torvalds f4ef6a438c Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
 
  - Use special field for function pointer in trace event
 
  - Fix igrab issues in uprobes
 
  - Fixes to the new histogram triggers
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuoYdBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtFnAP9X4+AVDQH0VfsMLSc9D+rK6WmcRIhv
 q8J2gNPv3anM+AD/SFXWGO4ihN+0KDw/TqmJxESNEybq47vTZ/s5lM6A4gQ=
 =fQbj
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes in tracing:

   - Tracepoints should not give warning on OOM failures

   - Use special field for function pointer in trace event

   - Fix igrab issues in uprobes

   - Fixes to the new histogram triggers"

* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoint: Do not warn on ENOMEM
  tracing: Add field modifier parsing hist error for hist triggers
  tracing: Add field parsing hist error for hist triggers
  tracing: Restore proper field flag printing when displaying triggers
  tracing: initcall: Ordered comparison of function pointers
  tracing: Remove igrab() iput() call from uprobes.c
  tracing: Fix bad use of igrab in trace_uprobe.c
2018-05-02 17:38:37 -10:00
Linus Torvalds ecd649b340 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
 "Just a few driver fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: atmel_mxt_ts - add missing compatible strings to OF device table
  Input: atmel_mxt_ts - fix the firmware update
  Input: atmel_mxt_ts - add touchpad button mapping for Samsung Chromebook Pro
  MAINTAINERS: Rakesh Iyer can't be reached anymore
  Input: hideep_ts - fix a typo in Kconfig
  Input: alps - fix reporting pressure of v3 trackstick
  Input: leds - fix out of bound access
  Input: synaptics-rmi4 - fix an unchecked out of memory error path
2018-05-02 17:34:42 -10:00
Julian Anastasov 94720e3aee ipv4: fix fnhe usage by non-cached routes
Allow some non-cached routes to use non-expired fnhe:

1. ip_del_fnhe: moved above and now called by find_exception.
The 4.5+ commit deed49df73 expires fnhe only when caching
routes. Change that to:

1.1. use fnhe for non-cached local output routes, with the help
from (2)

1.2. allow __mkroute_input to detect expired fnhe (outdated
fnhe_gw, for example) when do_cache is false, eg. when itag!=0
for unicast destinations.

2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
to use fnhe info even when the new route will not be cached into fnhe.
After commit 839da4d989 ("net: ipv4: set orig_oif based on fib
result for local traffic") it means all local routes will be affected
because they are not cached. This change is used to solve a PMTU
problem with IPVS (and probably Netfilter DNAT) setups that redirect
local clients from target local IP (local route to Virtual IP)
to new remote IP target, eg. IPVS TUN real server. Loopback has
64K MTU and we need to create fnhe on the local route that will
keep the reduced PMTU for the Virtual IP. Without this change
fnhe_pmtu is updated from ICMP but never exposed to non-cached
local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
with flowi4_oif=any for 4.14+).

3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
new entries

Fixes: 839da4d989 ("net: ipv4: set orig_oif based on fib result for local traffic")
Fixes: d6d5e999e5 ("route: do not cache fib route info on local routes with oif")
Fixes: deed49df73 ("route: check and remove route cache when we get route")
Cc: David Ahern <dsahern@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 22:52:35 -04:00
Linus Torvalds 3b6f979319 SCSI fixes on 20180502
Three small bug fixes: an illegally overlapping memcmp in target code,
 a potential infinite loop in isci under certain rare phy conditions
 and an ATA queue depth (performance) correction for storvsc.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWunNPiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdyJAQDGWGGN
 i5BJOr1c8BNEMHHuJIYvpgNJDLQd5tfqxieZgAD/e1RQn1nIjxnoIrduOAP+so8u
 XvGtP79yTO2yxz7VP48=
 =HASg
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Three small bug fixes: an illegally overlapping memcmp in target code,
  a potential infinite loop in isci under certain rare phy conditions
  and an ATA queue depth (performance) correction for storvsc"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target: Fix fortify_panic kernel exception
  scsi: isci: Fix infinite loop in while loop
  scsi: storvsc: Set up correct queue depth values for IDE devices
2018-05-02 16:38:17 -10:00
Dave Airlie 1e5fbc0b8d vc4: Fix bo refcounts during async commits (Boris)
vga-dac: Fix edid memory leak (Sean)
 
 Cc: Boris Brezillon <boris.brezillon@bootlin.com>
 Cc: Sean Paul <seanpaul@chromium.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEfxcpfMSgdnQMs+QqlvcN/ahKBwoFAlrqFqcACgkQlvcN/ahK
 BwraMwf+PppfIRNcyvUU3iPu6WiaodD9qOdt55AU6WK8SmEJ0Bk+10XNFjYHXOrA
 b/zUo6g7Zru+sneLKhZGwpGkBQ1IcuwichpQGd+OvknT2U8Ee8IwEZib7g9Vpqoi
 np4PWKMRI1/XWlCOHvkY6fAV7uVDWjkRKDmeUmpTRaWBQdq/QVAzk1z0hUW93mSY
 RqZZTNZBXPsuwW+wIdAIDlm5PSvYBbgEa2HVj44iK4PB7SlI5jGMtjbQpcbmmT20
 V4RvvLPeRBCizA5LiSu+HspRCvtBZfSHt5nRqONc11bmK0DjiZt8tBSsbHEUHEt6
 ySDjbjk2RVwVAE5wV3Z7xKT1CQOBZg==
 =yyCu
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2018-05-02' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

vc4: Fix bo refcounts during async commits (Boris)
vga-dac: Fix edid memory leak (Sean)

Cc: Boris Brezillon <boris.brezillon@bootlin.com>
Cc: Sean Paul <seanpaul@chromium.org>

* tag 'drm-misc-fixes-2018-05-02' of git://anongit.freedesktop.org/drm/drm-misc:
  drm/bridge: vga-dac: Fix edid memory leak
  drm/vc4: Make sure vc4_bo_{inc,dec}_usecnt() calls are balanced
2018-05-03 11:58:39 +10:00
Dave Airlie 083faae152 Merge tag 'drm-intel-fixes-2018-05-02' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
Add DMC firmware for Geminilake.

* tag 'drm-intel-fixes-2018-05-02' of git://anongit.freedesktop.org/drm/drm-intel:
  drm/i915/glk: Add MODULE_FIRMWARE for Geminilake
2018-05-03 11:58:19 +10:00
David S. Miller e002434e88 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-05-03

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several BPF sockmap fixes mostly related to bugs in error path
   handling, that is, a bug in updating the scatterlist length /
   offset accounting, a missing sk_mem_uncharge() in redirect
   error handling, and a bug where the outstanding bytes counter
   sg_size was not zeroed, from John.

2) Fix two memory leaks in the x86-64 BPF JIT, one in an error
   path where we still don't converge after image was allocated
   and another one where BPF calls are used and JIT passes don't
   converge, from Daniel.

3) Minor fix in BPF selftests where in test_stacktrace_build_id()
   we drop useless args in urandom_read and we need to add a missing
   newline in a CHECK() error message, from Song.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 20:42:44 -04:00
Alexei Starovoitov b5b6ff7302 Merge branch 'bpf-sockmap-fixes'
John Fastabend says:

====================
When I added the test_sockmap to selftests I mistakenly changed the
test logic a bit. The result of this was on redirect cases we ended up
choosing the wrong sock from the BPF program and ended up sending to a
socket that had no receive handler. The result was the actual receive
handler, running on a different socket, is timing out and closing the
socket. This results in errors (-EPIPE to be specific) on the sending
side. Typically happening if the sender does not complete the send
before the receive side times out. So depending on timing and the size
of the send we may get errors. This exposed some bugs in the sockmap
error path handling.

This series fixes the errors. The primary issue is we did not do proper
memory accounting in these cases which resulted in missing a
sk_mem_uncharge(). This happened in the redirect path and in one case
on the normal send path. See the three patches for the details.

The other take-away from this is we need to fix the test_sockmap and
also add more negative test cases. That will happen in bpf-next.

Finally, I tested this using the existing test_sockmap program, the
older sockmap sample test script, and a few real use cases with
Cilium. All of these seem to be in working correctly.

v2: fix compiler warning, drop iterator variable 'i' that is no longer
    used in patch 3.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:46 -07:00
John Fastabend abaeb096ca bpf: sockmap, fix error handling in redirect failures
When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.

With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.

The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:45 -07:00
John Fastabend fec51d40ea bpf: sockmap, zero sg_size on error when buffer is released
When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.

In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.

In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:45 -07:00
John Fastabend 3cc9a472d6 bpf: sockmap, fix scatterlist update on error path in send with apply
When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.

Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.

The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:44 -07:00
Eric Dumazet 7df40c2673 net_sched: fq: take care of throttled flows before reuse
Normally, a socket can not be freed/reused unless all its TX packets
left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
this to happen.

With commit fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for
reused flows") we cleared f->time_next_packet but took no special
action if the flow was still in the throttled rb-tree.

Since f->time_next_packet is the key used in the rb-tree searches,
blindly clearing it might break rb-tree integrity. We need to make
sure the flow is no longer in the rb-tree to avoid this problem.

Fixes: fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for reused flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 16:37:38 -04:00
Ido Schimmel 30ca22e4a5 ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"
This reverts commit edd7ceb782 ("ipv6: Allow non-gateway ECMP for
IPv6").

Eric reported a division by zero in rt6_multipath_rebalance() which is
caused by above commit that considers identical local routes to be
siblings. The division by zero happens because a nexthop weight is not
set for local routes.

Revert the commit as it does not fix a bug and has side effects.

To reproduce:

# ip -6 address add 2001:db8::1/64 dev dummy0
# ip -6 address add 2001:db8::1/64 dev dummy1

Fixes: edd7ceb782 ("ipv6: Allow non-gateway ECMP for IPv6")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 16:33:02 -04:00
Helge Deller 8d73b18079 parisc: Fix section mismatches
Fix three section mismatches:
1) Section mismatch in reference from the function ioread8() to the
   function .init.text:pcibios_init_bridge()
2) Section mismatch in reference from the function free_initmem() to the
   function .init.text:map_pages()
3) Section mismatch in reference from the function ccio_ioc_init() to
   the function .init.text:count_parisc_driver()

Signed-off-by: Helge Deller <deller@gmx.de>
2018-05-02 21:47:35 +02:00
Helge Deller b819439fea parisc: drivers.c: Fix section mismatches
Fix two section mismatches in drivers.c:
1) Section mismatch in reference from the function alloc_tree_node() to
   the function .init.text:create_tree_node().
2) Section mismatch in reference from the function walk_native_bus() to
   the function .init.text:alloc_pa_dev().

Signed-off-by: Helge Deller <deller@gmx.de>
2018-05-02 21:47:27 +02:00
Alexei Starovoitov 0f58e58e28 Merge branch 'x86-bpf-jit-fixes'
Daniel Borkmann says:

====================
Fix two memory leaks in x86 JIT. For details, please see
individual patches in this series. Thanks!
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 12:35:48 -07:00
Daniel Borkmann 39f56ca945 bpf, x64: fix memleak when not converging on calls
The JIT logic in jit_subprogs() is as follows: for all subprogs we
allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
and pass it to bpf_int_jit_compile(). If a failure occurred during
JIT and prog->jited is not set, then we bail out from attempting to
JIT the whole program, and punt to the interpreter instead. In case
JITing went successful, we fixup BPF call offsets and do another
pass to bpf_int_jit_compile() (extra_pass is true at that point) to
complete JITing calls. Given that requires to pass JIT context around
addrs and jit_data from x86 JIT are freed in the extra_pass in
bpf_int_jit_compile() when calls are involved (if not, they can
be freed immediately). However, if in the original pass, the JIT
image didn't converge then we leak addrs and jit_data since image
itself is NULL, the prog->is_func is set and extra_pass is false
in that case, meaning both will become unreachable and are never
cleaned up, therefore we need to free as well on !image. Only x64
JIT is affected.

Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 12:35:47 -07:00
Daniel Borkmann 3aab8884c9 bpf, x64: fix memleak when not converging after image
While reviewing x64 JIT code, I noticed that we leak the prior allocated
JIT image in the case where proglen != oldproglen during the JIT passes.
Prior to the commit e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT
compiler") we would just break out of the loop, and using the image as the
JITed prog since it could only shrink in size anyway. After e0ee9c1215,
we would bail out to out_addrs label where we free addrs and jit_data but
not the image coming from bpf_jit_binary_alloc().

Fixes: e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 12:35:47 -07:00
Sean Paul 49ceda9de2 drm/bridge: vga-dac: Fix edid memory leak
edid should be freed once it's finished being used.

Fixes: 56fe8b6f49 ("drm/bridge: Add RGB to VGA bridge support")
Cc: Rob Herring <robh@kernel.org>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Archit Taneja <architt@codeaurora.org>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Laurent Pinchart <Laurent.pinchart@ideasonboard.com>
Cc: <stable@vger.kernel.org> # v4.9+
Reviewed-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Sean Paul <seanpaul@chromium.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20180420190007.1572-1-seanpaul@chromium.org
2018-05-02 15:09:21 -04:00
Ursula Braun 784813aed6 net/smc: restrict non-blocking connect finish
The smc_poll code tries to finish connect() if the socket is in
state SMC_INIT and polling of the internal CLC-socket returns with
EPOLLOUT. This makes sense for a select/poll call following a connect
call, but not without preceding connect().
With this patch smc_poll starts connect logic only, if the CLC-socket
is no longer in its initial state TCP_CLOSE.

In addition, a poll error on the internal CLC-socket is always
propagated to the SMC socket.

With this patch the code path mentioned by syzbot
https://syzkaller.appspot.com/bug?extid=03faa2dc16b8b64be396
is no longer possible.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Reported-by: syzbot+03faa2dc16b8b64be396@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 13:27:19 -04:00
Ingo Molnar af3e0fcf78 8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
Use disable_irq_nosync() instead of disable_irq() as this might be
called in atomic context with netpoll.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 13:22:06 -04:00
Darrick J. Wong 021ba8e98f xfs: cap the length of deduplication requests
Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever.  Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").

Reported-by: matorola@gmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-05-02 09:21:33 -07:00
Rasmus Villemoes 9fc347678d modpost: delete stale comment
Commit 7840fea200 ("kbuild: Fix computing srcversion for modules")
fixed the comment above parse_source_files to refer to the new source_
line, but left this one behind that could still give the impression that
drivers/net/dummy.c appears in the deps_ variable.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-03 01:17:44 +09:00
Xin Long ce402f044e sctp: fix the issue that the cookie-ack with auth can't get processed
When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp
processes auth chunk first, then continues to the next chunk in
this packet if chunk_end + chunk_hdr size < skb_tail_pointer().
Otherwise, it will go to the next packet or discard this chunk.

However, it missed the fact that cookie-ack chunk's size is equal
to chunk_hdr size, which couldn't match that check, and thus this
chunk would not get processed.

This patch fixes it by changing the check to chunk_end + chunk_hdr
size <= skb_tail_pointer().

Fixes: 26b87c7881 ("net: sctp: fix remote memory pressure from excessive queueing")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 11:15:33 -04:00
Xin Long 46e16d4b95 sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
When processing a duplicate cookie-echo chunk, for case 'D', sctp will
not process the param from this chunk. It means old asoc has nothing
to be updated, and the new temp asoc doesn't have the complete info.

So there's no reason to use the new asoc when creating the cookie-ack
chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk
can not be set with auth, and it will definitely be dropped by peer.

This issue is there since very beginning, and we fix it by using the
old asoc instead.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 11:15:33 -04:00
Xin Long 4842a08fb8 sctp: init active key for the new asoc in dupcook_a and dupcook_b
When processing a duplicate cookie-echo chunk, for case 'A' and 'B',
after sctp_process_init for the new asoc, if auth is enabled for the
cookie-ack chunk, the active key should also be initialized.

Otherwise, the cookie-ack chunk made later can not be set with auth
shkey properly, and a crash can even be caused by this, as after
Commit 1b1e0bc994 ("sctp: add refcnt support for sh_key"), sctp
needs to hold the shkey when making control chunks.

Fixes: 1b1e0bc994 ("sctp: add refcnt support for sh_key")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 11:15:32 -04:00
Neal Cardwell e6e6a278b1 tcp_bbr: fix to zero idle_restart only upon S/ACKed data
Previously the bbr->idle_restart tracking was zeroing out the
bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
e.g. receiving incoming data or receiver window updates. In such
situations BBR would forget that this was a restart-from-idle
situation, and if the min_rtt had expired it would unnecessarily enter
PROBE_RTT (even though we were actually restarting from idle but had
merely forgotten that fact).

The fix is simple: we need to remember we are restarting from idle
until we receive a S/ACK for some data (a S/ACK for the first flight
of data we send as we are restarting).

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 11:12:32 -04:00
Grygorii Strashko 5e5add172e net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
In dual_mac mode packets arrived on one port should not be forwarded by
switch hw to another port. Only Linux Host can forward packets between
ports. The below test case (reported in [1]) shows that packet arrived on
one port can be leaked to anoter (reproducible with dual port evms):
 - connect port 1 (eth0) to linux Host 0 and run tcpdump or Wireshark
 - connect port 2 (eth1) to linux Host 1 with vlan 1 configured
 - ping <IPx> from Host 1 through vlan 1 interface.
ARP packets will be seen on Host 0.

Issue happens because dual_mac mode is implemnted using two vlans: 1 (Port
1+Port 0) and 2 (Port 2+Port 0), so there are vlan records created for for
each vlan. By default, the ALE will find valid vlan record in its table
when vlan 1 tagged packet arrived on Port 2 and so forwards packet to all
ports which are vlan 1 members (like Port.

To avoid such behaviorr the ALE VLAN ID Ingress Check need to be enabled
for each external CPSW port (ALE_PORTCTLn.VID_INGRESS_CHECK) so ALE will
drop ingress packets if Rx port is not VLAN member.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 11:08:23 -04:00
Thomas Gleixner c65732e4f7 x86/cpu: Restore CPUID_8000_0008_EBX reload
The recent commt which addresses the x86_phys_bits corruption with
encrypted memory on CPUID reload after a microcode update lost the reload
of CPUID_8000_0008_EBX as well.

As a consequence IBRS and IBRS_FW are not longer detected

Restore the behaviour by bringing the reload of CPUID_8000_0008_EBX
back. This restore has a twist due to the convoluted way the cpuid analysis
works:

CPUID_8000_0008_EBX is used by AMD to enumerate IBRB, IBRS, STIBP. On Intel
EBX is not used. But the speculation control code sets the AMD bits when
running on Intel depending on the Intel specific speculation control
bits. This was done to use the same bits for alternatives.

The change which moved the 8000_0008 evaluation out of get_cpu_cap() broke
this nasty scheme due to ordering. So that on Intel the store to
CPUID_8000_0008_EBX clears the IBRB, IBRS, STIBP bits which had been set
before by software.

So the actual CPUID_8000_0008_EBX needs to go back to the place where it
was and the phys/virt address space calculation cannot touch it.

In hindsight this should have used completely synthetic bits for IBRB,
IBRS, STIBP instead of reusing the AMD bits, but that's for 4.18.

/me needs to find time to cleanup that steaming pile of ...

Fixes: d94a155c59 ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption")
Reported-by: Jörg Otte <jrg.otte@gmail.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jörg Otte <jrg.otte@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kirill.shutemov@linux.intel.com
Cc: Borislav Petkov <bp@alien8.de
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805021043510.1668@nanos.tec.linutronix.de
2018-05-02 16:44:38 +02:00
Michael S. Tsirkin c818aa88d2 Revert "vhost: make msg padding explicit"
This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406.

Unfortunately the padding will break 32 bit userspace.
Ouch. Need to add some compat code, revert for now.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 10:27:18 -04:00
Peter Zijlstra 7dba33c634 clocksource: Rework stale comment
AFAICS the hotplug code no longer uses this function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Link: https://lkml.kernel.org/r/20180430100344.656525644@infradead.org
2018-05-02 16:10:41 +02:00
Peter Zijlstra cd2af07d82 clocksource: Consistent de-rate when marking unstable
When a registered clocksource gets marked unstable the watchdog_kthread
will de-rate and re-select the clocksource. Ensure it also de-rates
when getting called on an unregistered clocksource.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.594904898@infradead.org
2018-05-02 16:10:41 +02:00
Peter Zijlstra e3b4f79025 x86/tsc: Fix mark_tsc_unstable()
mark_tsc_unstable() also needs to affect tsc_early, Now that
clocksource_mark_unstable() can be used on a clocksource irrespective of
its registration state, use it on both tsc_early and tsc.

This does however require cs->list to be initialized empty, otherwise it
cannot tell the registation state before registation.

Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Diego Viola <diego.viola@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.533326547@infradead.org
2018-05-02 16:10:40 +02:00
Peter Zijlstra 5b9e886a4a clocksource: Initialize cs->wd_list
A number of places relies on list_empty(&cs->wd_list), however the
list_head does not get initialized. Do so upon registration, such that
thereafter it is possible to rely on list_empty() correctly reflecting
the list membership status.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Diego Viola <diego.viola@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Link: https://lkml.kernel.org/r/20180430100344.472662715@infradead.org
2018-05-02 16:10:40 +02:00
Peter Zijlstra 2aae7bcfa4 clocksource: Allow clocksource_mark_unstable() on unregistered clocksources
Because of how the code flips between tsc-early and tsc clocksources
it might need to mark one or both unstable. The current code in
mark_tsc_unstable() only worked because previously it registered the
tsc clocksource once and then never touched it.

Since it now unregisters the tsc-early clocksource, it needs to know
if a clocksource got unregistered and the current cs->mult test
doesn't work for that. Instead use list_empty(&cs->list) to test for
registration.

Furthermore, since clocksource_mark_unstable() needs to place the cs
on the wd_list, it links the cs->list and cs->wd_list serialization.
It must not see a clocsource registered (!empty cs->list) but already
past dequeue_watchdog(). So place {en,de}queue{,_watchdog}() under the
same lock.

Provided cs->list is initialized to empty, this then allows us to
unconditionally use clocksource_mark_unstable(), regardless of the
registration state.

Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Diego Viola <diego.viola@gmail.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180502135312.GS12217@hirez.programming.kicks-ass.net
2018-05-02 16:10:40 +02:00
Peter Zijlstra e9088adda1 x86/tsc: Always unregister clocksource_tsc_early
Don't leave the tsc-early clocksource registered if it errors out
early.

This was reported by Diego, who on his Core2 era machine got TSC
invalidated while it was running with tsc-early (due to C-states).
This results in keeping tsc-early with very bad effects.

Reported-and-Tested-by: Diego Viola <diego.viola@gmail.com>
Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.350507853@infradead.org
2018-05-02 16:10:40 +02:00
Agustin Vega-Frias 1bc2463cee irqchip/qcom: Fix check for spurious interrupts
When the interrupts for a combiner span multiple registers it must be
checked if any interrupts have been asserted on each register before
checking for spurious interrupts.

Checking each register seperately leads to false positive warnings.

[ tglx: Massaged changelog ]

Fixes: f20cc9b00c ("irqchip/qcom: Add IRQ combiner driver")
Signed-off-by: Agustin Vega-Frias <agustinv@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: timur@codeaurora.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1525184090-26143-1-git-send-email-agustinv@codeaurora.org
2018-05-02 15:56:10 +02:00
Michel Dänzer 892a0be43e swiotlb: fix inversed DMA_ATTR_NO_WARN test
The result was printing the warning only when we were explicitly asked
not to.

Cc: stable@vger.kernel.org
Fixes: 0176adb004 "swiotlb: refactor
 coherent buffer allocation"
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-02 14:48:55 +02:00
Filipe Manana a6aa10c70b Btrfs: send, fix missing truncate for inode with prealloc extent past eof
An incremental send operation can miss a truncate operation when an inode
has an increased size in the send snapshot and a prealloc extent beyond
its size.

Consider the following scenario where a necessary truncate operation is
missing in the incremental send stream:

1) In the parent snapshot an inode has a size of 1282957 bytes and it has
   no prealloc extents beyond its size;

2) In the the send snapshot it has a size of 5738496 bytes and has a new
   extent at offsets 1884160 (length of 106496 bytes) and a prealloc
   extent beyond eof at offset 6729728 (and a length of 339968 bytes);

3) When processing the prealloc extent, at offset 6729728, we end up at
   send.c:send_write_or_clone() and set the @len variable to a value of
   18446744073708560384 because @offset plus the original @len value is
   larger then the inode's size (6729728 + 339968 > 5738496). We then
   call send_extent_data(), with that @offset and @len, which in turn
   calls send_write(), and then the later calls fill_read_buf(). Because
   the offset passed to fill_read_buf() is greater then inode's i_size,
   this function returns 0 immediately, which makes send_write() and
   send_extent_data() do nothing and return immediately as well. When
   we get back to send.c:send_write_or_clone() we adjust the value
   of sctx->cur_inode_next_write_offset to @offset plus @len, which
   corresponds to 6729728 + 18446744073708560384 = 5738496, which is
   precisely the the size of the inode in the send snapshot;

4) Later when at send.c:finish_inode_if_needed() we determine that
   we don't need to issue a truncate operation because the value of
   sctx->cur_inode_next_write_offset corresponds to the inode's new
   size, 5738496 bytes. This is wrong because the last write operation
   that was issued started at offset 1884160 with a length of 106496
   bytes, so the correct value for sctx->cur_inode_next_write_offset
   should be 1990656 (1884160 + 106496), so that a truncate operation
   with a value of 5738496 bytes would have been sent to insert a
   trailing hole at the destination.

So fix the issue by making send.c:send_write_or_clone() not attempt
to send write or clone operations for extents that start beyond the
inode's size, since such attempts do nothing but waste time by
calling helper functions and allocating path structures, and send
currently has no fallocate command in order to create prealloc extents
at the destination (either beyond a file's eof or not).

The issue was found running the test btrfs/007 from fstests using a seed
value of 1524346151 for fsstress.

Reported-by: Gu, Jinxiang <gujx@cn.fujitsu.com>
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-02 11:55:29 +02:00
ethanwu 998ac6d21c btrfs: Take trans lock before access running trans in check_delayed_ref
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.

When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.

After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.

Fix this by using trans_lock and increasing the use_count.

Fixes: e4c3b2dcd1 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-02 11:54:58 +02:00