Commit Graph

603209 Commits

Author SHA1 Message Date
Filipe Manana
44f714dae5 Btrfs: improve performance on fsync against new inode after rename/unlink
With commit 56f23fdbb6 ("Btrfs: fix file/data loss caused by fsync after
rename and new inode") we got simple fix for a functional issue when the
following sequence of actions is done:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  move/rename existing file A from directory D to directory E
  create a new file named A at directory D
  fsync the new file
  power fail

The solution was to simply detect such scenario and fallback to a full
transaction commit when we detect it. However this turned out to had a
significant impact on throughput (and a bit on latency too) for benchmarks
using the dbench tool, which simulates real workloads from smbd (Samba)
servers. For example on a test vm (with a debug kernel):

Unpatched:
Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms

Patched:
Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms

The patched results (this patch is applied) are similar to the results of
a kernel with the commit 56f23fdbb6 ("Btrfs: fix file/data loss caused
by fsync after rename and new inode") reverted.

This change avoids the fallback to a transaction commit and instead makes
sure all the names of the conflicting inode (the one that had a name in a
past transaction that matches the name of the new file in the same parent
directory) are logged so that at log replay time we don't lose neither the
new file nor the old file, and the old file gets the name it was renamed
to.

This also ends up avoiding a full transaction commit for a similar case
that involves an unlink instead of a rename of the old file:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  remove file A
  create a new file named A at directory D
  fsync the new file
  power fail

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:14 +01:00
Filipe Manana
67710892ec Btrfs: be more precise on errors when getting an inode from disk
When we attempt to read an inode from disk, we end up always returning an
-ESTALE error to the caller regardless of the actual failure reason, which
can be an out of memory problem (when allocating a path), some error found
when reading from the fs/subvolume btree (like a genuine IO error) or the
inode does not exists. So lets start returning the real error code to the
callers so that they don't treat all -ESTALE errors as meaning that the
inode does not exists (such as during orphan cleanup). This will also be
needed for a subsequent patch in the same series dealing with a special
fsync case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:03 +01:00
Filipe Manana
951555856b Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.

For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.

So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:31:41 +01:00
Filipe Manana
15b253eace Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
The caller of send_utimes() is supposed to be sure that the inode number
it passes to this function does actually exists in the send snapshot.
However due to logic/algorithm bugs (such as the one fixed by the patch
titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes
operations"), this might not be the case and when that happens it makes
send_utimes() access use an unrelated leaf item as the target inode item
or access beyond a leaf's boundaries (when the leaf is full and
path->slots[0] matches the number of items in the leaf).

So if the call to btrfs_search_slot() done by send_utimes() does not find
the inode item, just make sure send_utimes() returns -ENOENT and does not
silently accesses unrelated leaf items or does invalid leaf accesses, also
allowing us to easialy and deterministically catch such algorithmic/logic
bugs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:26:15 +01:00
Robbie Ko
764433a12e Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes
that were children of directories which were removed in the send snapshot,
we can end up accessing incorrect items in a leaf or accessing beyond the
last item of the leaf due to issuing utimes operations for the removed
inodes. Consider the following example:

  Parent snapshot:
  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 262)
  |
  |--- b/                                                       (ino 258)
  |    |--- d/                                                  (ino 263)
  |
  |--- del/                                                     (ino 261)
        |--- x/                                                 (ino 259)
        |--- y/                                                 (ino 260)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |
  |--- b/                                                       (ino 258)
  |
  |--- c/                                                       (ino 262)
  |    |--- y/                                                  (ino 260)
  |
  |--- d/                                                       (ino 263)
       |--- x/                                                  (ino 259)

1) When processing inodes 259 and 260, we end up delaying their rename
   operations because their parents, inodes 263 and 262 respectively, were
   not yet processed and therefore not yet renamed;

2) When processing inode 262, its rename operation is issued and right
   after the rename operation for inode 260 is issued. However right after
   issuing the rename operation for inode 260, at send.c:apply_dir_move(),
   we issue utimes operations for all current and past parents of inode
   260. This means we try to send a utimes operation for its old parent,
   inode 261 (deleted in the send snapshot), which does not cause any
   immediate and deterministic failure, because when the target inode is
   not found in the send snapshot, the send.c:send_utimes() function
   ignores it and uses the leaf region pointed to by path->slots[0],
   which can be any unrelated item (belonging to other inode) or it can
   be a region outside the leaf boundaries, if the leaf is full and
   path->slots[0] matches the number of items in the leaf. So we end
   up either successfully sending a utimes operation, which is fine
   and irrelevant because the old parent (inode 261) will end up being
   deleted later, or we end up doing an invalid memory access tha
   crashes the kernel.

So fix this by making apply_dir_move() issue utimes operations only for
parents that still exist in the send snapshot. In a separate patch we
will make send_utimes() return an error (-ENOENT) if the given inode
does not exists in the send snapshot.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and better organized]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:25:48 +01:00
Robbie Ko
443f9d266c Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
Under certain situations, when doing an incremental send, we can end up
not freeing orphan_dir_info structures as soon as they are no longer
needed. Instead we end up freeing them only after finishing the send
stream, which causes a warning to be emitted:

[282735.229200] ------------[ cut here ]------------
[282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
[282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[282735.240160]  0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000
[282735.240160]  0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8
[282735.240160]  ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000
[282735.240160] Call Trace:
[282735.240160]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[282735.240160]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[282735.240160]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[282735.240160]  [<ffffffffa03c99d5>] btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
[282735.240160]  [<ffffffffa0398358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[282735.240160]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[282735.240160]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[282735.240160]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[282735.240160]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[282735.240160]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[282735.240160]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[282735.240160]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[282735.240160]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[282735.240160]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
[282735.256343] ---[ end trace a4539270c8056f93 ]---

Consider the following example:

  Parent snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 260)
  |
  |--- del/                                                     (ino 259)
        |--- tmp/                                               (ino 258)
        |--- x/                                                 (ino 261)
        |--- y/                                                 (ino 262)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- x/                                                  (ino 261)
  |    |--- y/                                                  (ino 262)
  |
  |--- c/                                                       (ino 260)
       |--- tmp/                                                (ino 258)

1) When processing inode 258, we end up delaying its rename operation
   because it has an ancestor (in the send snapshot) that has a higher
   inode number (inode 260) which was also renamed in the send snapshot,
   therefore we delay the rename of inode 258 so that it happens after
   inode 260 is renamed;

2) When processing inode 259, we end up delaying its deletion (rmdir
   operation) because it has a child inode (258) that has its rename
   operation delayed. At this point we allocate an orphan_dir_info
   structure and tag inode 258 so that we later attempt to see if we
   can delete (rmdir) inode 259 once inode 258 is renamed;

3) When we process inode 260, after renaming it we finally do the rename
   operation for inode 258. Once we issue the rename operation for inode
   258 we notice that this inode was tagged so that we attempt to see
   if at this point we can delete (rmdir) inode 259. But at this point
   we can not still delete inode 259 because it has 2 children, inodes
   261 and 262, that were not yet processed and therefore not yet
   moved (renamed) away from inode 259. We end up not freeing the
   orphan_dir_info structure allocated in step 2;

4) We process inodes 261 and 262, and once we move/rename inode 262
   we issue the rmdir operation for inode 260;

5) We finish the send stream and notice that red black tree that
   contains orphan_dir_info structures is not empty, so we emit
   a warning and then free any orphan_dir_structures left.

So fix this by freeing an orphan_dir_info structure once we try to
apply a pending rename operation if we can not delete yet the tagged
directory.

A test case for fstests follows soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog to be more detailed and easier to understand]
2016-08-01 07:25:31 +01:00
Robbie Ko
99ea42ddb1 Btrfs: incremental send, fix premature rmdir operations
Under certain situations, an incremental send operation can contain
a rmdir operation that will make the receiving end fail when attempting
to execute it, because the target directory is not yet empty.

Consider the following example:

  Parent snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- c/                                                  (ino 260)
  |
  |--- del/                                                     (ino 259)
        |--- tmp/                                               (ino 258)
        |--- x/                                                 (ino 261)

  Send snapshot:

  .                                                             (ino 256)
  |--- a/                                                       (ino 257)
  |    |--- x/                                                  (ino 261)
  |
  |--- c/                                                       (ino 260)
       |--- tmp/                                                (ino 258)

1) When processing inode 258, we delay its rename operation because inode
   260 is its new parent in the send snapshot and it was not yet renamed
   (since 260 > 258, that is, beyond the current progress);

2) When processing inode 259, we realize we can not yet send an rmdir
   operation (against inode 259) because inode 258 was still not yet
   renamed/moved away from inode 259. Therefore we update data structures
   so that after inode 258 is renamed, we try again to see if we can
   finally send an rmdir operation for inode 259;

3) When we process inode 260, we send a rename operation for it followed
   by a rename operation for inode 258. Once we send the rename operation
   for inode 258 we then check if we can finally issue an rmdir for its
   previous parent, inode 259, by calling the can_rmdir() function with
   a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress"
   argument. This makes can_rmdir() return true (value 1) because even
   though there's still a child inode of inode 259 that was not yet
   renamed/moved, which is inode 261, the given value of progress (261)
   is not lower then 261 (that is, not lower than the inode number of
   some child of inode 259). So we end up sending a rmdir operation for
   inode 259 before its child inode 261 is processed and renamed.

So fix this by passing the correct progress value to the call to
can_rmdir() from within apply_dir_move() (where we issue delayed rename
operations), which should match stcx->cur_ino (the number of the inode
currently being processed) and not sctx->cur_ino + 1.

A test case for fstests follows soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed, clear and well formatted]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:25:12 +01:00
Filipe Manana
4122ea64f8 Btrfs: incremental send, fix invalid paths for rename operations
Example scenario:

  Parent snapshot:

  .                                                       (ino 277)
  |---- tmp/                                              (ino 278)
  |---- pre/                                              (ino 280)
  |      |---- wait_dir/                                  (ino 281)
  |
  |---- desc/                                             (ino 282)
  |---- ance/                                             (ino 283)
  |       |---- below_ance/                               (ino 279)
  |
  |---- other_dir/                                        (ino 284)

  Send snapshot:

  .                                                       (ino 277)
  |---- tmp/                                              (ino 278)
         |---- other_dir/                                 (ino 284)
                   |---- below_ance/                      (ino 279)
                   |            |---- pre/                (ino 280)
                   |
                   |---- wait_dir/                        (ino 281)
                              |---- desc/                 (ino 282)
                                      |---- ance/         (ino 283)

While computing the send stream the following steps happen:

1) While processing inode 279 we end up delaying its rename operation
   because its new parent in the send snapshot, inode 284, was not
   yet processed and therefore not yet renamed;

2) Later when processing inode 280 we end up renaming it immediately to
   "ance/below_once/pre" and not delay its rename operation because its
   new parent (inode 279 in the send snapshot) has its rename operation
   delayed and inode 280 is not an encestor of inode 279 (its parent in
   the send snapshot) in the parent snapshot;

3) When processing inode 281 we end up delaying its rename operation
   because its new parent in the send snapshot, inode 284, was not yet
   processed and therefore not yet renamed;

4) When processing inode 282 we do not delay its rename operation because
   its parent in the send snapshot, inode 281, already has its own rename
   operation delayed and our current inode (282) is not an ancestor of
   inode 281 in the parent snapshot. Therefore inode 282 is renamed to
   "ance/below_ance/pre/wait_dir";

5) When processing inode 283 we realize that we can rename it because one
   of its ancestors in the send snapshot, inode 281, has its rename
   operation delayed and inode 283 is not an ancestor of inode 281 in the
   parent snapshot. So a rename operation to rename inode 283 to
   "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
   invalid due to a missing path building loop that was undetected by
   the incremental send implementation, as inode 283 ends up getting
   included twice in the path (once with its path in the parent snapshot).
   Therefore its rename operation must wait before the ancestor inode 284
   is renamed.

Fix this by not terminating the rename dependency checks when we find an
ancestor, in the send snapshot, that has its rename operation delayed. So
that we continue doing the same checks if the current inode is not an
ancestor, in the parent snapshot, of an ancestor in the send snapshot we
are processing in the loop.

The problem and reproducer were reported by Robbie Ko, as part of a patch
titled "Btrfs: incremental send, avoid ancestor rename to descendant".
However the fix was unnecessarily complicated and can be addressed with
much less code and effort.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:24:45 +01:00
Filipe Manana
7969e77a73 Btrfs: send, add missing error check for calls to path_loop()
The function path_loop() can return a negative integer, signaling an
error, 0 if there's no path loop and 1 if there's a path loop. We were
treating any non zero values as meaning that a path loop exists. Fix
this by explicitly checking for errors and gracefully return them to
user space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:23:20 +01:00
Robbie Ko
801bec365e Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.

For example, consider the following scenario:

  Parent snapshot:

  .                   (ino 256)
  |---- d/            (ino 257)
  |     |--- p1/      (ino 258)
  |
  |---- p1/           (ino 259)

  Send snapshot:

  .                    (ino 256)
  |--- d/              (ino 257)
       |--- p1/        (ino 259)
             |--- p1/  (ino 258)

The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:

[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136]  [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104]  [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---

There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:

  Parent snapshot:

  .
  |---- a/                                                   (ino 262)
  |     |---- c/                                             (ino 268)
  |
  |---- d/                                                   (ino 263)
        |---- ance/                                          (ino 267)
                |---- e/                                     (ino 264)
                |---- f/                                     (ino 265)
                |---- ance/                                  (ino 266)

  Send snapshot:

  .
  |---- a/                                                   (ino 262)
  |---- c/                                                   (ino 268)
  |     |---- ance/                                          (ino 267)
  |
  |---- d/                                                   (ino 263)
  |     |---- ance/                                          (ino 266)
  |
  |---- f/                                                   (ino 265)
        |---- e/                                             (ino 264)

When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:

  ERROR: rename d/ance/ance -> d/ance failed: No such file or directory

So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.

A test case for fstests will follow soon.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
 comments]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:23:10 +01:00
Filipe Manana
0596a9048b Btrfs: add missing check for writeback errors on fsync
When we start an fsync we start ordered extents for all delalloc ranges.
However before attempting to log the inode, we only wait for those ordered
extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's flags). This means that if an ordered extent
completes with an IO error before we check if we can skip logging the
inode, we will not catch and report the IO error to user space. This is
because on an IO error, when the ordered extent completes we do not
update the inode, so if the inode was not previously updated by the
current transaction we end up not logging it through calls to fsync and
therefore not check its mapping flags for the presence of IO errors.

Fix this by checking for errors in the flags of the inode's mapping when
we notice we can skip logging the inode.

This caused sporadic failures in the test generic/331 (which explicitly
tests for IO errors during an fsync call).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-08-01 07:21:13 +01:00
Chris Mason
8b8b08cbfb Btrfs: fix delalloc accounting after copy_from_user faults
Commit 56244ef151 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v4.6+
2016-07-21 04:03:40 -07:00
Josef Bacik
bac357dcec Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction.  This enforces
the correct flush type to avoid both deadlocks and assertions

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2016-07-20 16:58:04 -07:00
Chris Mason
24502ebb60 Merge branch 'kdave-part1-enospc' into for-linus-4.8 2016-07-20 11:46:52 -07:00
Josef Bacik
8ca17f0f59 Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes
We used to allow you to set FLUSH_ALL and then just wouldn't do things like
commit transactions or wait on ordered extents if we noticed you were in a
transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
we've lost the ability to tell, and we could end up deadlocking.  So instead use
FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
we error out to preserve the previous behavior.  I've also added an ASSERT() to
catch anybody else who tries to do this.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ac2fabac42 Btrfs: fill relocation block rsv after allocation
Since we set the reloc control before we've reserved our space for relocation we
could race with a root being dirtied and not actually have space to do our init
reloc root.  So once we've allocated it and set it up go ahead and make our
reservation before setting the relocate control, that way anybody who tries to
do the reloc root init has space to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
40acc3eede Btrfs: always use trans->block_rsv for orphans
This is the case all the time anyway except for relocation which could be doing
a reloc root for a non ref counted root, in which case we'd end up with some
random block rsv rather than the one we have our reservation in.  If there isn't
enough space in the block rsv we are trying to steal from we'll BUG() because we
expect there to be space for the orphan to make its reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ae2e472881 Btrfs: change how we calculate the global block rsv
Traditionally we've calculated the global block rsv by guessing how much of the
metadata used amount was the extent tree, and then taking the data size and
figuring out how large the csum tree would have to be to hold that much data.

This is imprecise and falls down on MIXED file systems as we can't trust the
data used amount.  This resulted in failures for xfstests generic/333 because it
creates lots of clones, which explodes out the extent tree.  Our global reserve
calculations were woefully inaccurate in this case which meant we got into a
situation where we did not have enough reserved to do our work.

We know we only use the global block rsv for the extent, csum, and root trees,
so just get the bytes used for these trees and use that as the basis of our
global reserve.  Since these are not reference counted trees the bytes_used
value will be accurate.  This fixed the transaction aborts seen with
generic/333.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
87241c2e68 Btrfs: use root when checking need_async_flush
Instead of doing fs_info->fs_root in need_async_flush, which may not be set
during recovery when mounting, just pass the root itself in, which makes more
sense as thats what btrfs_calc_reclaim_metadata_size takes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d38b349c39 Btrfs: don't bother kicking async if there's nothing to reclaim
We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
31bada7c4e Btrfs: fix release reserved extents trace points
We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space.  We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
dce3afa593 Btrfs: add fsid to some tracepoints
When tracing enospc problems on a box with multiple file systems mounted I need
to be able to differentiate between the two file systems.  Most of the important
trace points I'm looking at already have an fsid, but the reserved extent trace
points do not, so add that to make it possible to figure out which trace point
belongs to which file system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f376df2b7d Btrfs: add tracepoints for flush events
We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f485c9ee32 Btrfs: fix delalloc reservation amount tracepoint
We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c51e7bb184 Btrfs: trace pinned extents
Pinned extents are an important metric to keep track of for enospc.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
957780eb27 Btrfs: introduce ticketed enospc infrastructure
Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c83f8effef Btrfs: add tracepoint for adding block groups
I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d555b6c380 Btrfs: warn_on for unaccounted spaces
These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c48f49d63d Btrfs: change delayed reservation fallback behavior
We reserve space for the inode update when we first reserve space for writing to
a file.  However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents.  Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve.  The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly.  In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations.  Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
48c3d480e4 Btrfs: always reserve metadata for delalloc extents
There are a few races in the metadata reservation stuff.  First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes.  So use the normal
btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation.  So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
25d609f86d Btrfs: fix callers of btrfs_block_rsv_migrate
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
e40edf2da4 Btrfs: add bytes_readonly to the spaceinfo at once
For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info.  This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit.  Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Linus Torvalds
a99cde438d Linux 4.7-rc6 2016-07-03 23:01:00 -07:00
Linus Torvalds
0b295dd5b8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
 "This makes sure userspace filesystems are not broken by the parallel
  lookups and readdir feature"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: serialize dirops by default
2016-07-03 12:02:00 -07:00
Linus Torvalds
236bfd8ed8 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This contains fixes for a dentry leak, a regression in 4.6 noticed by
  Docker users and missing write access checking in truncate"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: warn instead of error if d_type is not supported
  ovl: get_write_access() in truncate
  ovl: fix dentry leak for default_permissions
2016-07-03 11:57:09 -07:00
Vivek Goyal
e7c0b5991d ovl: warn instead of error if d_type is not supported
overlay needs underlying fs to support d_type. Recently I put in a
patch in to detect this condition and started failing mount if
underlying fs did not support d_type.

But this breaks existing configurations over kernel upgrade. Those who
are running docker (partially broken configuration) with xfs not
supporting d_type, are surprised that after kernel upgrade docker does
not run anymore.

https://github.com/docker/docker/issues/22937#issuecomment-229881315

So instead of erroring out, detect broken configuration and warn
about it. This should allow existing docker setups to continue
working after kernel upgrade.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 45aebeaf4f ("ovl: Ensure upper filesystem supports d_type")
Cc: <stable@vger.kernel.org> 4.6
2016-07-03 09:39:31 +02:00
Linus Torvalds
4f302921c1 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fix from Ralf Baechle:
 "Only a single fix for 4.7 pending at this point.  It fixes an issue
  that may lead to corruption of the cache mode bits in the page table"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: Fix possible corruption of cache mode by mprotect.
2016-07-02 19:10:21 -07:00
Linus Torvalds
70bd68d7f7 powerpc fixes for 4.7 #5
- tm: Always reclaim in start_thread() for exec() class syscalls from Cyril Bur
  - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael Neuling
  - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan
  - Initialise pci_io_base as early as possible from Darren Stevens
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXeEmsAAoJEFHr6jzI4aWAwMMQAKs/u9rwB3gpOkNJSHajN1Dd
 kdqDufzLxLDwbWnMfqM1+bcO2EOjPhKbtpbzhG6oeiET8undRRoLsjHS5rZeYK5h
 cviRPEJ/Yz8ZWaIgFGI8+02gXwU0MJhuTY8NPexXmsh4FRdKYwEuCIJShl30lg22
 P7UrJ2SCNM+H/uZyS07B7thiwBeAKSp6VkLTpuW/QDz2j1ra/F22dTh7c0Agdahd
 INAMAnh9nYeuMVYn4XjOOlQ07JnBTuf1/W5Wxlw4i/86rVq+Hy8zh5r1X52oysR5
 lZl63B9q3agKG9cc9lSN2ibTDVerlFMwB2QysX2a6Uy7+y2SB3hS7VS1RTXCh3hg
 /omApGGVW3Hh+E2CuKfFLQySU55NRpLAoTGravGr/KsH4wZP/n/fkrctldCrqm7P
 sTPT52+t+iJQk4fiskRY3yQ7DTTnt3rTC8MJRGqvLuCheolLll4NQaWOF75AJP+7
 WFWtC4QHOTPERMkhqLnZDG2vNuDg1H8chuZ2+PxtIs6G1vuOEun+MTZAYh4u6XWE
 bAIT9rV3xBdE17bzYOQz7lU1y7yNVtP7xkm0HIOAHlU4gUrjQp5u8F3TnPW3/M0m
 8GeaZdrPjhsaNg31YZODAeM8Ddf+N9d2a2VPIr/fzytURhMe0ss3Z/MdMoYRATab
 Lh1o+G3gDo9MVaphoJ3w
 =oEAY
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - tm: Always reclaim in start_thread() for exec() class syscalls from
   Cyril Bur

 - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael
   Neuling

 - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan

 - Initialise pci_io_base as early as possible from Darren Stevens

* tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc: Initialise pci_io_base as early as possible
  powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
  powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()
  powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
2016-07-02 17:47:54 -07:00
Linus Torvalds
99b0f54e6a Merge tag 'drm-fixes-for-v4.7-rc6' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes frlm Dave Airlie:
 "Just some AMD and Intel fixes, the AMD ones are further production
  Polaris fixes, and the Intel ones fix some early timeouts, some PCI ID
  changes and a couple of other fixes.

  Still a bit Internet challenged here, hopefully end of next week will
  solve it"

* tag 'drm-fixes-for-v4.7-rc6' of git://people.freedesktop.org/~airlied/linux:
  drm/i915: Fix missing unlock on error in i915_ppgtt_info()
  drm/amd/powerplay: workaround for UVD clock issue
  drm/amdgpu: add ACLK_CNTL setting for polaris10
  drm/amd/powerplay: fix issue uvd dpm can't enabled on Polaris11.
  drm/amd/powerplay: Workaround for Memory EDC Error on Polaris10.
  drm/i915: Removing PCI IDs that are no longer listed as Kabylake.
  drm/i915: Add more Kabylake PCI IDs.
  drm/i915: Avoid early timeout during AUX transfers
  drm/i915/hsw: Avoid early timeout during LCPLL disable/restore
  drm/i915/lpt: Avoid early timeout during FDI PHY reset
  drm/i915/bxt: Avoid early timeout during PLL enable
  drm/i915: Refresh cached DP port register value on resume
  drm/amd/powerplay: Update CKS on/ CKS off voltage offset calculation
  drm/amd/powerplay: disable FFC.
  drm/amd/powerplay: add some definition for FFC feature on polaris.
2016-07-02 09:41:28 -07:00
Linus Torvalds
467ce7693f spi: Fixes for v4.7
A few small driver-specific fixes for SPI, all in the normal important
 if you hit them category especially the rockchip driver fix which
 addresses a race which has been exposed more frequently with some recent
 performance improvements.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXd4PmAAoJECTWi3JdVIfQ0A4H/1D1m/bsdAY0FS12YQteV1GQ
 JYUvj/CWNiuV1MSqIU0WWrlMFO9EwtsojytLIfmGNPXUvPMSpW/Q3IJqhRrQWhmN
 Zl1FUUUIz+sGBsqwq6nZbDJxJhpT6/Tb7YDFR6Oi4l7VeB4Sisv5ax6Ay+uwa/mp
 cvrb/ULILLCHAv0v+6rSjIkZlvD1Yc+08SbbUrTPtVdcY0TDbJEkZ+U2IviCsxmP
 PuqdPUgFIEy7j+hnbEib24f5BNZ/m1a0DY012+es7fSkD5DVa2h71kxS54RC6QYn
 BkkMdK7moflBpbipKlI/eBPz73eePO9SxBwUZFLSUzG4DJEJE2mKKRVyU2LHpGA=
 =uU1M
 -----END PGP SIGNATURE-----

Merge tag 'spi-fix-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A few small driver-specific fixes for SPI, all in the normal important
  if you hit them category especially the rockchip driver fix which
  addresses a race which has been exposed more frequently with some
  recent performance improvements"

* tag 'spi-fix-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: sunxi: fix transfer timeout
  spi: sun4i: fix FIFO limit
  spi: rockchip: Signal unfinished DMA transfers
  spi: spi-ti-qspi: Suspend the queue before removing the device
2016-07-02 09:40:11 -07:00
Linus Torvalds
a2b0db5b55 regulator: Fixes for v4.7
Two small fixes for the regulator subsystem - one fixing a crash with
 one of the devices supported by the max77620 driver, another fixing
 startup for the anatop regulator when it starts up with the regulator in
 bypass mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXd4GeAAoJECTWi3JdVIfQ+g8H/jpOOKEri6SVaatEbk2J33oA
 YD6EpSV4BRVGQRRLSY11tz3Md45xPLhe47kRpT1antgplKVEAJYARGYGVLmCovGz
 nFXqbQzeTvxs6o6UdrkJlFLDmn4yZgrL4MqhjfxLUzX+Yz/3neQZq6KhESCWKphT
 WPBrNa90s0j+nBCGJV0LxcuoZiKt6th/GUjr0gngepckrg3gIxJJyRVvtE9iVyip
 JYdVTfHt/aYJihIdKAPnXa4M9Ky5KY8ZNTySYcyUaXT3VLDK/UNFFO25q4Nw6F5a
 6XgcjwcFYDpZitf7pQYYfoobDbgTJ1XllPujEP82rIafruAvAKtfORHnvVJfRlU=
 =ZA94
 -----END PGP SIGNATURE-----

Merge tag 'regulator-fix-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "Two small fixes for the regulator subsystem - one fixing a crash with
  one of the devices supported by the max77620 driver, another fixing
  startup for the anatop regulator when it starts up with the regulator
  in bypass mode"

* tag 'regulator-fix-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: max77620: check for valid regulator info
  regulator: anatop: allow regulator to be in bypass mode
2016-07-02 09:39:03 -07:00
Linus Torvalds
4438512005 A small fix for the newly added oxnas clk driver and a handful of
rockchip clk driver fixes for newly added rk3399 support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXdwP4AAoJEK0CiJfG5JUlG8oP/R+kfW2OBvwrJ4ubGnl68obU
 PoLqATqi9S3Zq7nEd4B04+SQT4OBgj63YnCAEwJY9D7KSCoJNYL9nML9FzB/MCiC
 R2Yune3PIsaEuZFZObnURGO5eRxyZqSJEGgzck1LXwrX/BCyrVHGYsJ0EWm2exZo
 zp8YPpM73oILwwCJGivVATNWXFT9HtQ1nfxAmH/e288t6nNuCnXPJf7PLm1t6ERF
 fjQiPRILodQqB3OtKEYvEv2z682RC0Zivwv6O/dd0FEG33aXB2VgAlCz8Ttp/brf
 fFV5L+i7XjDGDyBcZR2dYCOGwx42XYFOKvUpK+ePnbD0x23V3MSjwFU4yeItZlnD
 3PKO0lIJgbRjMtL4a03iWLzcaxDEbMb+/RVPMUtUWoGyId9l2S+cUblB7NuzTfLL
 RnTu6ttUSh3IY81eiAkrDRjafdAkswOf3lJweg+Rx2OZhzaKX0UcpDBaeUdzdqx6
 lxZ8chv8zYnfjgYUEu2jYvIIM/0VoHZgDmvarNRMreATPaOYqDMTfwsGZeTYcvNU
 qZhpNmue3c1cKGAYwzO3elB3KwhuB+83DVXGJu+6/RabbSk3OTMhaGwIH1NSQeCj
 nStb6C9h+Mog9ZTFukUlx7js6d0O8GmjlT4crlWxJ4QGdDvS2j7/hT53e8uK7SLl
 N5WzJrO9szvkZ8ZC9/LI
 =fAAu
 -----END PGP SIGNATURE-----

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
 "A small fix for the newly added oxnas clk driver and a handful of
  rockchip clk driver fixes for newly added rk3399 support"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: Fix return value check in oxnas_stdclk_probe()
  clk: rockchip: release io resource when failing to init clk on rk3399
  clk: rockchip: fix cpuclk registration error handling
  clk: rockchip: Revert "clk: rockchip: reset init state before mmc card initialization"
  clk: rockchip: fix incorrect parent for rk3399's {c,g}pll_aclk_perihp_src
  clk: rockchip: mark rk3399 GIC clocks as critical
  clk: rockchip: initialize flags of clk_init_data in mmc-phase clock
2016-07-02 09:36:49 -07:00
Dave Airlie
88c087109b Merge tag 'drm-intel-fixes-2016-06-30' of git://anongit.freedesktop.org/drm-intel into drm-fixes
here's a batch of i915 fixes for 4.7.

* tag 'drm-intel-fixes-2016-06-30' of git://anongit.freedesktop.org/drm-intel:
  drm/i915: Fix missing unlock on error in i915_ppgtt_info()
  drm/i915: Removing PCI IDs that are no longer listed as Kabylake.
  drm/i915: Add more Kabylake PCI IDs.
  drm/i915: Avoid early timeout during AUX transfers
  drm/i915/hsw: Avoid early timeout during LCPLL disable/restore
  drm/i915/lpt: Avoid early timeout during FDI PHY reset
  drm/i915/bxt: Avoid early timeout during PLL enable
  drm/i915: Refresh cached DP port register value on resume
2016-07-02 15:50:41 +10:00
Dave Airlie
40793e85d2 Merge branch 'drm-fixes-4.7' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Just a few more late fixes for Polaris cards.

* 'drm-fixes-4.7' of git://people.freedesktop.org/~agd5f/linux:
  drm/amd/powerplay: workaround for UVD clock issue
  drm/amdgpu: add ACLK_CNTL setting for polaris10
  drm/amd/powerplay: fix issue uvd dpm can't enabled on Polaris11.
  drm/amd/powerplay: Workaround for Memory EDC Error on Polaris10.
  drm/amd/powerplay: Update CKS on/ CKS off voltage offset calculation
  drm/amd/powerplay: disable FFC.
  drm/amd/powerplay: add some definition for FFC feature on polaris.
2016-07-02 15:48:33 +10:00
Ralf Baechle
6d037de90a MIPS: Fix possible corruption of cache mode by mprotect.
The following testcase may result in a page table entries with a invalid
CCA field being generated:

static void *bindstack;

static int sysrqfd;

static void protect_low(int protect)
{
	mprotect(bindstack, BINDSTACK_SIZE, protect);
}

static void sigbus_handler(int signal, siginfo_t * info, void *context)
{
	void *addr = info->si_addr;

	write(sysrqfd, "x", 1);

	printf("sigbus, fault address %p (should not happen, but might)\n",
	       addr);
	abort();
}

static void run_bind_test(void)
{
	unsigned int *p = bindstack;

	p[0] = 0xf001f001;

	write(sysrqfd, "x", 1);

	/* Set trap on access to p[0] */
	protect_low(PROT_NONE);

	write(sysrqfd, "x", 1);

	/* Clear trap on access to p[0] */
	protect_low(PROT_READ | PROT_WRITE | PROT_EXEC);

	write(sysrqfd, "x", 1);

	/* Check the contents of p[0] */
	if (p[0] != 0xf001f001) {
		write(sysrqfd, "x", 1);

		/* Reached, but shouldn't be */
		printf("badness, shouldn't happen but does\n");
		abort();
	}
}

int main(void)
{
	struct sigaction sa;

	sysrqfd = open("/proc/sysrq-trigger", O_WRONLY);

	if (sigprocmask(SIG_BLOCK, NULL, &sa.sa_mask)) {
		perror("sigprocmask");
		return 0;
	}

	sa.sa_sigaction = sigbus_handler;
	sa.sa_flags = SA_SIGINFO | SA_NODEFER | SA_RESTART;
	if (sigaction(SIGBUS, &sa, NULL)) {
		perror("sigaction");
		return 0;
	}

	bindstack = mmap(NULL,
			 BINDSTACK_SIZE,
			 PROT_READ | PROT_WRITE | PROT_EXEC,
			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (bindstack == MAP_FAILED) {
		perror("mmap bindstack");
		return 0;
	}

	printf("bindstack: %p\n", bindstack);

	run_bind_test();

	printf("done\n");

	return 0;
}

There are multiple ingredients for this:

 1) PAGE_NONE is defined to _CACHE_CACHABLE_NONCOHERENT, which is CCA 3
    on all platforms except SB1 where it's CCA 5.
 2) _page_cachable_default must have bits set which are not set
    _CACHE_CACHABLE_NONCOHERENT.
 3) Either the defective version of pte_modify for XPA or the standard
    version must be in used.  However pte_modify for the 36 bit address
    space support is no affected.

In that case additional bits in the final CCA mode may generate an invalid
value for the CCA field.  On the R10000 system where this was tracked
down for example a CCA 7 has been observed, which is Uncached Accelerated.

Fixed by:

 1) Using the proper CCA mode for PAGE_NONE just like for all the other
    PAGE_* pte/pmd bits.
 2) Fix the two affected variants of pte_modify.

Further code inspection also shows the same issue to exist in pmd_modify
which would affect huge page systems.

Issue in pte_modify tracked down by Alastair Bridgewater, PAGE_NONE
and pmd_modify issue found by me.

The history of this goes back beyond Linus' git history.  Chris Dearman's
commit 351336929c ("[MIPS] Allow setting of
the cache attribute at run time.") missed the opportunity to fix this
but it was originally introduced in lmo commit
d523832cf12007b3242e50bb77d0c9e63e0b6518 ("Missing from last commit.")
and 32cc38229ac7538f2346918a09e75413e8861f87 ("New configuration option
CONFIG_MIPS_UNCACHED.")

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reported-by: Alastair Bridgewater <alastair.bridgewater@gmail.com>
2016-07-02 01:51:39 +02:00
Linus Torvalds
dbdc3bb74f ACPI fix for v4.7-rc6
Fix an expression in the ACPI PCI IRQ management code added by a
 recent commit that overlooked missing parens in it, so the result
 of the computation is incorrect in some cases (Sinan Kaya).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXdtScAAoJEILEb/54YlRx5r4P/AwWjfh3wPsDCXCajSkiqR7b
 raPnXYXvVhEC4ZfqLpuID1sk/DzQL/MZmkFpB8PUYkf1onOKXcd9MDsyiVdOuJi/
 gqP64yzGI+n79qq4u+vhUlujsDH11X8WtKxNQoZWWdHoLR5o/qhYA4c3kZ0yjzim
 uD/3GWtXsmbPMyBMJjOIFJLFpj3DgizxER3PUL0WdX1XB8YXX/e16WT4+8IkHVru
 mM8Mk389+iQoKfDgWAL9ImOcfY9xXh0IPkKF/kHT2D0CG4qHfjGWUfu124hu83Fp
 YGIhcmX0De5cwb3PchQ4Uga2U4KbzKgAhW574RFD+pCmY5EX/fjYDaPPCU4Gctzf
 RNTG3JIS+mpR2maC+t73lJUanQSxS4GahXtEGfJ7ci3q85sIXuX5GSqd9CkXScCZ
 /0Ww+xjiXyeo1EE/PFC0rXZT7h5bzCdpqWbUUhcSDnFKj8TLz/LjiNd6vCFZAVUF
 G/7p9rY2n+Nid8j7DVITKcYS6HuaucKOTGKbPp7P3HBXc5N3d5vqrSrVcwQ/Yi+g
 hwaoDle506BUvTgIwrjCEPBRFRmFz8uRHOMMbpugzYZm2Kn5UUF3Za358NqI6pYR
 LiosgEiCXYFQScDeFOV5QD9hRGRoWaL6C54po1d459MGjqgwDt4lolbOhy7BIWXj
 pJMmdHbczB0VT6KRyYxs
 =r8BR
 -----END PGP SIGNATURE-----

Merge tag 'acpi-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "Fix an expression in the ACPI PCI IRQ management code added by a
  recent commit that overlooked missing parens in it, so the result of
  the computation is incorrect in some cases (Sinan Kaya)"

* tag 'acpi-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI,PCI,IRQ: correct operator precedence
2016-07-01 15:31:48 -07:00
Linus Torvalds
81dbd6f59d Power management fixes for v4.7-rc6
- Fix a recent intel_pstate regression that caused the number of
    wakeups to increase significantly on an idle system in some cases
    due to excessive synchronize_sched() invocations (Rafael Wysocki).
 
  - Fix unnecessary invocations of WARN_ON() in the cpufreq core
    after cpufreq has been suspended introduced during the 4.6 cycla
    (Rafael Wysocki).
 
  - Fix an error code path in the cpufreq-dt-platdev driver that
    forgets to drop a reference to a DT node (Masahiro Yamada).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXdtQFAAoJEILEb/54YlRxZEkQAKHFWVRGwIGRT4mF6Tx8P6PN
 mg+9fQbTKAKNFnNZZp6xGhch97s1sIWf5GPRTS72zZkP8oHLERNphVhh7ucPslk7
 XicBaawxmwIAlalcB8S8I/i8hGZqwYpTgpPx0DsDxA8E9XJLnELmiJpiB520FoQp
 Ym+PzbE5LHHPdxZrNhzMejfOvi5X1svFvRQRxTnHF638kfOJswmABX0WRGyYjtHP
 FDasdVv+S7JbJhCP/nOTVW50/Scao67nL8GqtiovYjWPBq85+WYAo63qHK46XwlG
 NgOMIqVfdb89e7aoPmlHW/FS9QAqQYOsK+pZv+fimRCAMottu1XEXLSzBzQYQSqU
 RYW3cJkyf06F4Vk+7Jpgdujia4Fv2OBU1qQKmvvlXyPLZVhssTfTECx/kkEiQtrK
 Id8rzlyYaPIDRdpG1cOWH3my8Ypv75EWvuDnBlBojweiq+1twcncLbIBRmY0HMCB
 QAcnxZTqNHq5lacgyixwOPKIkSRoXT0dFt6hnGX0CE9NdiplhSltQ/uBmDNz6Asi
 nRq2A+3Bwv6uc5PlHr4IfA4xZ7f8oAxj42dybAOgAM0dU89LjmOtaLCtYDxH6mSk
 xQBTe3R6tLbFK++lI5XBgjRRxbrwixXIZFwK6bdJnESVf8+UH4I1SbkwoXf0cxtx
 jTQ1C2XLx5oxdvTE2J60
 =+apA
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Three cpufreq fixes, one in the core (stable-candidate) and two in
  drivers (intel_pstate and cpufreq-dt).

  Specifics:

   - Fix a recent intel_pstate regression that caused the number of
     wakeups to increase significantly on an idle system in some cases
     due to excessive synchronize_sched() invocations (Rafael Wysocki).

   - Fix unnecessary invocations of WARN_ON() in the cpufreq core after
     cpufreq has been suspended introduced during the 4.6 cycla (Rafael
     Wysocki).

   - Fix an error code path in the cpufreq-dt-platdev driver that
     forgets to drop a reference to a DT node (Masahiro Yamada)"

* tag 'pm-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Avoid false-positive WARN_ON()s in cpufreq_update_policy()
  cpufreq: dt: call of_node_put() before error out
  intel_pstate: Do not clear utilization update hooks on policy changes
2016-07-01 15:28:22 -07:00
Linus Torvalds
48c4565ed6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Tmpfs readdir throughput regression fix (this cycle) + some -stable
  fodder all over the place.

  One missing bit is Miklos' tonight locks.c fix - NFS folks had already
  grabbed that one by the time I woke up ;-)"

[ The locks.c fix came through the nfsd tree just moments ago ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namespace: update event counter when umounting a deleted dentry
  9p: use file_dentry()
  ceph: fix d_obtain_alias() misuses
  lockless next_positive()
  libfs.c: new helper - next_positive()
  dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock
2016-07-01 15:20:11 -07:00
Linus Torvalds
2728c57fda One fix for lockd soft lookups in an error path, and one fix for file
leases on overlayfs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXdr1UAAoJECebzXlCjuG+QlsQAJiZTmio6k9tupN5+iKsZNL3
 m919ooj8GYsXxlC0OTFfi09dUi1yeF8MEE3j9egk/+qxEtsdmKdEOIy0RVdcLSfd
 HeGjXgLh79hVcGxgyBP+pdax2XhZ3RVisg8F5gTw2GPo+FPFZfrEuO5h7ctn+t45
 MCQ+4yqYqzEhYnoPyo5XKh5Aj6wBWiaTzg3/jSe6uSuSfuBfyaMaBPq7l7ayGra/
 5El+tu61o/SrJ41N2EayWSj/bOFJE92LIuGOh8NdfANuuP70JhxlwgVSldah3CCQ
 6PymXAVcjw0+gJ00mokzKfTJW5FPfasxckMHaOvcFsSONy4rlmrwwqUr9C2AFzTE
 wQGIzibCDYOSI8uF+//Oe+dh8JWp2TF8rfJcmKLyJMIcCq/Xl6cNx1qrq/oSWvuk
 WKOv1otJrPeT31h/s5iLjr/E/Po1eX+d2ySJdvUHYu/5aZwFgWPnVXwJk0s9bLow
 auZU85tPnuz+tbS2pWEK+el7LMgDBdzraVRogMdH1c+m3+G9pr53EzmpYovkZ2X8
 duVJ2Leslyya347TnJAgEY47Fbeu26JaoeIChGVhKcEyCENlcqAJWaGVECrxvs3y
 p/Y2MYMkO8YrCz5wQXPiLFiG4rAc+jSIn1Q+vRGl2Pkel0y7AgJNNMFtANjMCSIO
 pg6BqUjOyKt8cXVy4UW7
 =K0mr
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux

Pull lockd/locks fixes from Bruce Fields:
 "One fix for lockd soft lookups in an error path, and one fix for file
  leases on overlayfs"

* tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux:
  locks: use file_inode()
  lockd: unregister notifier blocks if the service fails to come up completely
2016-07-01 15:18:49 -07:00
Linus Torvalds
0d064a7b9c - Final patches fixing Reset API change
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXdkJCAAoJEFGvii+H/Hdh4fIP/0U47cICszk6AkV2P3q3yRc6
 RyIfjBddVxAnNnUQoJAkyDnZmZnCH33wMa3NG8BlXCrtPUSMoUkG78RIN6mWjgp6
 pooaXBylqIiU4ZoyQVXvxTQZhM7u/ox+2InrXOeZXirkJvRY4GzwsNWtdhzclHq2
 iM86K1cfaIPdJ4d3y7WF6/5DOsdMiYQUvmtxLEYTKgJgGvkgM/9I7Ws8vk3R0lXk
 wO8EItxIZecAIEwfwI8HgYCoyPctevMSnXMTWAJTF+5m3Wprhhv0lEjCTEBHbmJb
 kkrAr86tNj4nNVG9pQCDYL/XX0Zpub0vTA4+mPL5/WTrVpI/RRYlJXKIlR25vvlN
 wlYRrY4mE4gKFJh7t6DhQiHdrrQcoWPymKZ7CSF2GZvc65xwBa0u3ZkeLhurgj0N
 C8Zj7QRIqHfP4O8JDMRvuToWJWtc8sfSvh2Y9szkevwNjZjO9INSM/hStfJZ1S0E
 EnZIcHSKehjdiLHiMgyWGRB3Fq24u41u8aVAGvCfCsSe7CZKYKzyAHEQSC7solSJ
 GmbJ6F2gB65v62rNReZeP9NCc25fex7ZkdYMr6XYqrSh7GoxtF+9VPHEvYXs5BVK
 +rVVCOvuvqSy4+oQ+N/0grLHIThqSVbAusXBSLFPykHO7R4Z30f3wIiSPTq8m2VW
 c5qwGR4ceOBRrX8fQw0c
 =PpU9
 -----END PGP SIGNATURE-----

Merge tag 'mfd-fixes-4.7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull more MFD fixes from Lee Jones:
 "Apologies for missing these from the first pull request.

  Final patches fixing Reset API change"

* tag 'mfd-fixes-4.7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  usb: dwc3: st: Use explicit reset_control_get_exclusive() API
  phy: phy-stih407-usb: Use explicit reset_control_get_exclusive() API
  phy: miphy28lp: Inform the reset framework that our reset line may be shared
2016-07-01 15:17:16 -07:00