The performance hit from these checks is significant, but unoptimized
builds are already incredibly slow. Enabling these checks results in
better test coverage since there are bots doing unoptimized builds, and
the cost is relatively small in the context of an unoptimized build.
This also allows using `JEMALLOC_FLAGS` to override the default
configure flags.
If you browse to, say, http://doc.rust-lang.org/libc/types/os/common/posix01/struct.timeval.html , you will see the "location" window showing `libc::types::os::common::posix01`. The first element points to a crate and others point modules. This patch adds the bold attribute to the first (ie. crate) element so that it stands out more.
This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, `rustc` now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization.
On the machine I tested this on, `librustc` build time with `-O` went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without `-O` was 90s (single-threaded). Bootstrapping `rustc` using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building `librustc` with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen).
The user-visible changes from this branch are two new codegen flags:
* `-C codegen-units=N`: Distribute items across `N` compilation units.
* `-C codegen-threads=N`: Spawn `N` worker threads for running optimization and codegen. (It is possible to set `codegen-threads` larger than `codegen-units`, but this is not very useful.)
Internal changes to the compiler are described in detail on the individual commit messages.
Note: The first commit on this branch is copied from #16359, which this branch depends on.
r? @nick29581
Adjust the handling of `#[inline]` items so that they get translated into every
compilation unit that uses them. This is necessary to preserve the semantics
of `#[inline(always)]`.
Crate-local `#[inline]` functions and statics are blindly translated into every
compilation unit. Cross-crate inlined items and monomorphizations of
`#[inline]` functions are translated the first time a reference is seen in each
compilation unit. When using multiple compilation units, inlined items are
given `available_externally` linkage whenever possible to avoid duplicating
object code.
Add a post-processing pass to `trans` that converts symbols from external to
internal when possible. Translation with multiple compilation units initially
makes most symbols external, since it is not clear when translating a
definition whether that symbol will need to be accessed from another
compilation unit. This final pass internalizes symbols that are not reachable
from other crates and not referenced from other compilation units, so that LLVM
can perform more aggressive optimizations on those symbols.
Use a shared lookup table of previously-translated monomorphizations/glue
functions to avoid translating those functions in every compilation unit where
they're used. Instead, the function will be translated in whichever
compilation unit uses it first, and the remaining compilation units will link
against that original definition.
Rotate between compilation units while translating. The "worker threads"
commit added support for multiple compilation units, but only translated into
one, leaving the rest empty. With this commit, `trans` rotates between various
compilation units while translating, using a simple stragtegy: upon entering a
module, switch to translating into whichever compilation unit currently
contains the fewest LLVM instructions.
Most of the actual changes here involve getting symbol linkage right, so that
items translated into different compilation units will link together properly
at the end.
When inlining an item from another crate, use the original symbol from that
crate's metadata instead of generating a new symbol using the `ast::NodeId` of
the inlined copy. This requires exporting symbols in the crate metadata in a
few additional cases. Having predictable symbols for inlined items will be
useful later to avoid generating duplicate object code for inlined items.
Refactor the code in `llvm::back` that invokes LLVM optimization and codegen
passes so that it can be called from worker threads. (Previously, it used
`&Session` extensively, and `Session` is not `Share`.) The new code can handle
multiple compilation units, by compiling each unit to `crate.0.o`, `crate.1.o`,
etc., and linking together all the `crate.N.o` files into a single `crate.o`
using `ld -r`. The later linking steps can then be run unchanged.
The new code preserves the behavior of `--emit`/`-o` when building a single
compilation unit. With multiple compilation units, the `--emit=asm/ir/bc`
options produce multiple files, so combinations like `--emit=ir -o foo.ll` will
not actually produce `foo.ll` (they instead produce several `foo.N.ll` files).
The new code supports `-Z lto` only when using a single compilation unit.
Compiling with multiple compilation units and `-Z lto` will produce an error.
(I can't think of any good reason to do such a thing.) Linking with `-Z lto`
against a library that was built as multiple compilation units will also fail,
because the rlib does not contain a `crate.bytecode.deflate` file. This could
be supported in the future by linking together the `crate.N.bc` files produced
when compiling the library into a single `crate.bc`, or by making the LTO code
support multiple `crate.N.bytecode.deflate` files.
Break up `CrateContext` into `SharedCrateContext` and `LocalCrateContext`. The
local piece corresponds to a single compilation unit, and contains all
LLVM-related components. (LLVM data structures are tied to a specific
`LLVMContext`, and we will need separate `LLVMContext`s to safely run
multithreaded optimization.) The shared piece contains data structures that
need to be shared across all compilation units, such as the `ty::ctxt` and some
tables related to crate metadata.