Indicate change in RSS from start to end of pass in time-passes output
Previously, this was omitted because it could be misleading, but the
functionality seems too useful not to include.
r? ``@oli-obk``
Indicate both start and end of pass RSS in time-passes output
Previously, only the end of pass RSS was indicated. This could easily
lead one to believe that the change in RSS from one pass to the next was
attributable to the second pass, when in fact it occurred between the
end of the first pass and the start of the second.
Also, improve alignment of columns.
Sample of output:
```
time: 0.739; rss: 607MB -> 637MB item_types_checking
time: 8.429; rss: 637MB -> 775MB item_bodies_checking
time: 11.063; rss: 470MB -> 775MB type_check_crate
time: 0.232; rss: 775MB -> 777MB match_checking
time: 0.139; rss: 777MB -> 779MB liveness_and_intrinsic_checking
time: 0.372; rss: 775MB -> 779MB misc_checking_2
time: 8.188; rss: 779MB -> 1019MB MIR_borrow_checking
time: 0.062; rss: 1019MB -> 1021MB MIR_effect_checking
```
Previously, only the end of pass RSS was indicated. This could easily
lead one to believe that the change in RSS from one pass to the next was
attributable to the second pass, when in fact it occurred between the
end of the first pass and the start of the second.
Also, improve alignment of columns.
Enforce that query results implement Debug
Currently, we require that query keys implement `Debug`, but we do not do the same for query values. This can make incremental compilation bugs difficult to debug - there isn't a good place to print out the result loaded from disk.
This PR adds `Debug` bounds to several query-related functions, allowing us to debug-print the query value when an 'unstable fingerprint' error occurs. This required adding `#[derive(Debug)]` to a fairly large number of types - hopefully, this doesn't have much of an impact on compiler bootstrapping times.
Before:
```
thread 'rustc' panicked at 'attempt to read from stolen value', /home/joshua/rustc/compiler/rustc_data_structures/src/steal.rs:43:15
```
After:
```
thread 'rustc' panicked at 'attempt to steal from stolen value', compiler/rustc_mir/src/transform/mod.rs:423:25
```
Reduce a large memory spike that happens during serialization by writing
the incr comp structures to file by way of a fixed-size buffer, rather
than an unbounded vector.
Effort was made to keep the instruction count close to that of the
previous implementation. However, buffered writing to a file inherently
has more overhead than writing to a vector, because each write may
result in a handleable error. To reduce this overhead, arrangements are
made so that each LEB128-encoded integer can be written to the buffer
with only one capacity and error check. Higher-level optimizations in
which entire composite structures can be written with one capacity and
error check are possible, but would require much more work.
The performance is mostly on par with the previous implementation, with
small to moderate instruction count regressions. The memory reduction is
significant, however, so it seems like a worth-while trade-off.
Stabilize or_insert_with_key
Stabilizes the `or_insert_with_key` feature from https://github.com/rust-lang/rust/issues/71024. This allows inserting key-derived values when a `HashMap`/`BTreeMap` entry is vacant.
The difference between this and `.or_insert_with(|| ... )` is that this provides a reference to the key to the closure after it is moved with `.entry(key_being_moved)`, avoiding the need to copy or clone the key.
They were originally called "opt-in, built-in traits" (OIBITs), but
people realized that the name was too confusing and a mouthful, and so
they were renamed to just "auto traits". The feature flag's name wasn't
updated, though, so that's what this PR does.
There are some other spots in the compiler that still refer to OIBITs,
but I don't think changing those now is worth it since they are internal
and not particularly relevant to this PR.
Also see <https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/opt-in.2C.20built-in.20traits.20(auto.20traits).20feature.20name>.
Reworks Sccc computation to iteration instead of recursion
Linear graphs, producing as many scc's as nodes, would recurse once for every node when entered from the start of the list. This adds a test that exhausted the stack at least on my machine with error:
```
thread 'graph::scc::tests::test_deep_linear' has overflowed its stack
fatal runtime error: stack overflow
```
This may or may not be connected to #78567. I was only reminded that I started this rework some time ago. It might be plausible as borrow checking a long function with many borrow regions around each other—((((((…))))))— may produce the linear list setup to trigger this stack overflow ? I don't know enough about borrow check to say for sure.
This is best read in two separate commits. The first addresses only `find_state` internally. This is classical union phase from union-find. There's also a common solution of using the parent pointers in the (virtual) linked list to track the backreferences while traversing upwards and then following them backwards in a second path compression phase.
The second is more involved as it rewrites the mutually recursive `walk_node` and `walk_unvisited_node`. Firstly, the caller is required to handle the unvisited case of `walk_node` so a new `start_walk_from` method is added to handle that by walking the unvisited node if necessary. Then `walk_unvisited_node`, where we would previously recurse into in the missing case, is rewritten to construct a manual stack of its frames. The state fields consist of the previous stack slots.
[self-profiling] Include the estimated size of each cgu in the profile
This is helpful when looking for CGUs where the size estimate isn't a
good indicator of compilation time.
I verified that moving the profiling timer call doesn't affect the
results.
Results:
<img width="297" alt="Screen Shot 2020-11-03 at 7 25 04 AM" src="https://user-images.githubusercontent.com/831192/97985503-5901d100-1da6-11eb-9f10-f3e399702952.png">
`measureme` doesn't have support for custom arg names yet so `arg0` is the CGU name and `arg1` is the estimated size.
Move likely/unlikely argument outside of invisible unsafe block
The previous `likely!`/`unlikely!` macros were unsound because it permits the caller's expr to contain arbitrary unsafe code.
```rust
pub fn huh() -> bool {
likely!(std::ptr::read(&() as *const () as *const bool))
}
```
**Before:** compiles cleanly.
**After:**
```console
error[E0133]: call to unsafe function is unsafe and requires unsafe function or block
|
70 | likely!(std::ptr::read(&() as *const () as *const bool))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ call to unsafe function
|
= note: consult the function's documentation for information on how to avoid undefined behavior
```
The previous `likely!`/`unlikely!` macros were unsound because it
permits the caller's expr to contain arbitrary unsafe code.
pub fn huh() -> bool {
likely!(std::ptr::read(&() as *const () as *const bool))
}
Before: compiles cleanly.
After:
error[E0133]: call to unsafe function is unsafe and requires unsafe function or block
|
70 | likely!(std::ptr::read(&() as *const () as *const bool))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ call to unsafe function
|
= note: consult the function's documentation for information on how to avoid undefined behavior
This allows constructing the sccc for large that visit many nodes before
finding a single cycle of sccc, for example lists. When used to find
dependencies in borrow checking the list case is what occurs in very
long functions.
The basic conversion is a straightforward conversion of the linear
recursion to a loop forwards and backwards propagation of the result.
But this uses an optimization to avoid the need for extra space that
would otherwise be necessary to store the stack of unfinished states as
the function is not tail recursive.
Observe that only non-root-nodes in cycles have a recursive call and
that every such call overwrites their own node state. Thus we reuse the
node state itself as temporary storage for the stack of unfinished
states by inverting the links to a chain back to the previous state
update. When we hit the root or end of the full explored chain we
propagate the node state update backwards by following the chain until
a node with a link to itself.
This is helpful when looking for CGUs where the size estimate isn't a
good indicator of compilation time.
I verified that moving the profiling timer call doesn't affect the
results.
The previous recursive approach might overflow the stack when walking a
particularly deep, list-like, graph. In particular, dominator
calculation for borrow checking does such a traversal and very long
functions might lead to a region dependency graph with in this
problematic structure.
Avoid BorrowMutError with RUSTC_LOG=debug
```console
$ touch empty.rs
$ env RUSTC_LOG=debug rustc +stage1 --crate-type=lib empty.rs
```
Fails with a `BorrowMutError` because source map files are already
borrowed while `features_query` attempts to format a log message
containing a span.
Release the borrow before the query to avoid the issue.
perf: buffer SipHasher128
This is an attempt to improve Siphasher128 performance by buffering input. Although it reduces instruction count, I'm not confident the effect on wall times, or lack-thereof, is worth the change.
---
Additional notes not reflected in source comments:
* Implementation choices were guided by a combination of results from rustc-perf and micro-benchmarks, mostly the former.
* ~~I tried a couple of different struct layouts that might be more cache friendly with no obvious effect.~~ Update: a particular struct layout was chosen, but it's not critical to performance. See comments in source and discussion below.
* I suspect that buffering would be important to a SIMD-accelerated algorithm, but from what I've read and my own tests, SipHash does not seem very amenable to SIMD acceleration, at least by SSE.
Simplify query proc-macros
The query code generation is split between proc-macros and regular macros in `rustc_middle::ty::query`.
This PR removes unused capabilities of the proc-macros, and tend to use regular macros for the logic.
Try to make ObligationForest more efficient
This PR tries to decrease the number of allocations in ObligationForest, as well as moves some cold path code to an uninlined function.