This commit adds compiler support for two basic operations needed for binding
SIMD on x86 platforms:
* First, a `nontemporal_store` intrinsic was added for the `_mm_stream_ps`, seen
in rust-lang-nursery/stdsimd#114. This was relatively straightforward and is
quite similar to the volatile store intrinsic.
* Next, and much more intrusively, a new type to the backend was added. The
`x86_mmx` type is used in LLVM for a 64-bit vector register and is used in
various intrinsics like `_mm_abs_pi8` as seen in rust-lang-nursery/stdsimd#74.
This new type was added as a new layout option as well as having support added
to the trans backend. The type is enabled with the `#[repr(x86_mmx)]`
attribute which is intended to just be an implementation detail of SIMD in
Rust.
I'm not 100% certain about how the `x86_mmx` type was added, so any extra eyes
or thoughts on that would be greatly appreciated!
std: Add a new wasm32-unknown-unknown target
This commit adds a new target to the compiler: wasm32-unknown-unknown. This target is a reimagining of what it looks like to generate WebAssembly code from Rust. Instead of using Emscripten which can bring with it a weighty runtime this instead is a target which uses only the LLVM backend for WebAssembly and a "custom linker" for now which will hopefully one day be direct calls to lld.
Notable features of this target include:
* There is zero runtime footprint. The target assumes nothing exists other than the wasm32 instruction set.
* There is zero toolchain footprint beyond adding the target. No custom linker is needed, rustc contains everything.
* Very small wasm modules can be generated directly from Rust code using this target.
* Most of the standard library is stubbed out to return an error, but anything related to allocation works (aka `HashMap`, `Vec`, etc).
* Naturally, any `#[no_std]` crate should be 100% compatible with this new target.
This target is currently somewhat janky due to how linking works. The "linking" is currently unconditional whole program LTO (aka LLVM is being used as a linker). Naturally that means compiling programs is pretty slow! Eventually though this target should have a linker.
This target is also intended to be quite experimental. I'm hoping that this can act as a catalyst for further experimentation in Rust with WebAssembly. Breaking changes are very likely to land to this target, so it's not recommended to rely on it in any critical capacity yet. We'll let you know when it's "production ready".
### Building yourself
First you'll need to configure the build of LLVM and enable this target
```
$ ./configure --target=wasm32-unknown-unknown --set llvm.experimental-targets=WebAssembly
```
Next you'll want to remove any previously compiled LLVM as it needs to be rebuilt with WebAssembly support. You can do that with:
```
$ rm -rf build
```
And then you're good to go! A `./x.py build` should give you a rustc with the appropriate libstd target.
### Test support
Currently testing-wise this target is looking pretty good but isn't complete. I've got almost the entire `run-pass` test suite working with this target (lots of tests ignored, but many passing as well). The `core` test suite is [still getting LLVM bugs fixed](https://reviews.llvm.org/D39866) to get that working and will take some time. Relatively simple programs all seem to work though!
In general I've only tested this with a local fork that makes use of LLVM 5 rather than our current LLVM 4 on master. The LLVM 4 WebAssembly backend AFAIK isn't broken per se but is likely missing bug fixes available on LLVM 5. I'm hoping though that we can decouple the LLVM 5 upgrade and adding this wasm target!
### But the modules generated are huge!
It's worth nothing that you may not immediately see the "smallest possible wasm module" for the input you feed to rustc. For various reasons it's very difficult to get rid of the final "bloat" in vanilla rustc (again, a real linker should fix all this). For now what you'll have to do is:
cargo install --git https://github.com/alexcrichton/wasm-gc
wasm-gc foo.wasm bar.wasm
And then `bar.wasm` should be the smallest we can get it!
---
In any case for now I'd love feedback on this, particularly on the various integration points if you've got better ideas of how to approach them!
This commit adds a new target to the compiler: wasm32-unknown-unknown. This
target is a reimagining of what it looks like to generate WebAssembly code from
Rust. Instead of using Emscripten which can bring with it a weighty runtime this
instead is a target which uses only the LLVM backend for WebAssembly and a
"custom linker" for now which will hopefully one day be direct calls to lld.
Notable features of this target include:
* There is zero runtime footprint. The target assumes nothing exists other than
the wasm32 instruction set.
* There is zero toolchain footprint beyond adding the target. No custom linker
is needed, rustc contains everything.
* Very small wasm modules can be generated directly from Rust code using this
target.
* Most of the standard library is stubbed out to return an error, but anything
related to allocation works (aka `HashMap`, `Vec`, etc).
* Naturally, any `#[no_std]` crate should be 100% compatible with this new
target.
This target is currently somewhat janky due to how linking works. The "linking"
is currently unconditional whole program LTO (aka LLVM is being used as a
linker). Naturally that means compiling programs is pretty slow! Eventually
though this target should have a linker.
This target is also intended to be quite experimental. I'm hoping that this can
act as a catalyst for further experimentation in Rust with WebAssembly. Breaking
changes are very likely to land to this target, so it's not recommended to rely
on it in any critical capacity yet. We'll let you know when it's "production
ready".
---
Currently testing-wise this target is looking pretty good but isn't complete.
I've got almost the entire `run-pass` test suite working with this target (lots
of tests ignored, but many passing as well). The `core` test suite is still
getting LLVM bugs fixed to get that working and will take some time. Relatively
simple programs all seem to work though!
---
It's worth nothing that you may not immediately see the "smallest possible wasm
module" for the input you feed to rustc. For various reasons it's very difficult
to get rid of the final "bloat" in vanilla rustc (again, a real linker should
fix all this). For now what you'll have to do is:
cargo install --git https://github.com/alexcrichton/wasm-gc
wasm-gc foo.wasm bar.wasm
And then `bar.wasm` should be the smallest we can get it!
---
In any case for now I'd love feedback on this, particularly on the various
integration points if you've got better ideas of how to approach them!
This commit is an implementation of LLVM's ThinLTO for consumption in rustc
itself. Currently today LTO works by merging all relevant LLVM modules into one
and then running optimization passes. "Thin" LTO operates differently by having
more sharded work and allowing parallelism opportunities between optimizing
codegen units. Further down the road Thin LTO also allows *incremental* LTO
which should enable even faster release builds without compromising on the
performance we have today.
This commit uses a `-Z thinlto` flag to gate whether ThinLTO is enabled. It then
also implements two forms of ThinLTO:
* In one mode we'll *only* perform ThinLTO over the codegen units produced in a
single compilation. That is, we won't load upstream rlibs, but we'll instead
just perform ThinLTO amongst all codegen units produced by the compiler for
the local crate. This is intended to emulate a desired end point where we have
codegen units turned on by default for all crates and ThinLTO allows us to do
this without performance loss.
* In anther mode, like full LTO today, we'll optimize all upstream dependencies
in "thin" mode. Unlike today, however, this LTO step is fully parallelized so
should finish much more quickly.
There's a good bit of comments about what the implementation is doing and where
it came from, but the tl;dr; is that currently most of the support here is
copied from upstream LLVM. This code duplication is done for a number of
reasons:
* Controlling parallelism means we can use the existing jobserver support to
avoid overloading machines.
* We will likely want a slightly different form of incremental caching which
integrates with our own incremental strategy, but this is yet to be
determined.
* This buys us some flexibility about when/where we run ThinLTO, as well as
having it tailored to fit our needs for the time being.
* Finally this allows us to reuse some artifacts such as our `TargetMachine`
creation, where all our options we used today aren't necessarily supported by
upstream LLVM yet.
My hope is that we can get some experience with this copy/paste in tree and then
eventually upstream some work to LLVM itself to avoid the duplication while
still ensuring our needs are met. Otherwise I fear that maintaining these
bindings may be quite costly over the years with LLVM updates!
This commit is a refactoring of the LTO backend in Rust to support compilations
with multiple codegen units. The immediate result of this PR is to remove the
artificial error emitted by rustc about `-C lto -C codegen-units-8`, but longer
term this is intended to lay the groundwork for LTO with incremental compilation
and ultimately be the underpinning of ThinLTO support.
The problem here that needed solving is that when rustc is producing multiple
codegen units in one compilation LTO needs to merge them all together.
Previously only upstream dependencies were merged and it was inherently relied
on that there was only one local codegen unit. Supporting this involved
refactoring the optimization backend architecture for rustc, namely splitting
the `optimize_and_codegen` function into `optimize` and `codegen`. After an LLVM
module has been optimized it may be blocked and queued up for LTO, and only
after LTO are modules code generated.
Non-LTO compilations should look the same as they do today backend-wise, we'll
spin up a thread for each codegen unit and optimize/codegen in that thread. LTO
compilations will, however, send the LLVM module back to the coordinator thread
once optimizations have finished. When all LLVM modules have finished optimizing
the coordinator will invoke the LTO backend, producing a further list of LLVM
modules. Currently this is always a list of one LLVM module. The coordinator
then spawns further work to run LTO and code generation passes over each module.
In the course of this refactoring a number of other pieces were refactored:
* Management of the bytecode encoding in rlibs was centralized into one module
instead of being scattered across LTO and linking.
* Some internal refactorings on the link stage of the compiler was done to work
directly from `CompiledModule` structures instead of lists of paths.
* The trans time-graph output was tweaked a little to include a name on each
bar and inflate the size of the bars a little
This PR is an implementation of [RFC 1974] which specifies a new method of
defining a global allocator for a program. This obsoletes the old
`#![allocator]` attribute and also removes support for it.
[RFC 1974]: https://github.com/rust-lang/rfcs/pull/197
The new `#[global_allocator]` attribute solves many issues encountered with the
`#![allocator]` attribute such as composition and restrictions on the crate
graph itself. The compiler now has much more control over the ABI of the
allocator and how it's implemented, allowing much more freedom in terms of how
this feature is implemented.
cc #27389
This support is needed for bindgen to work well on 32-bit Windows, and
also enables people to begin experimenting with C++ FFI support on that
platform.
Fixes#42044.
When -Z profile is passed, the GCDAProfiling LLVM pass is added
to the pipeline, which uses debug information to instrument the IR.
After compiling with -Z profile, the $(OUT_DIR)/$(CRATE_NAME).gcno
file is created, containing initial profiling information.
After running the program built, the $(OUT_DIR)/$(CRATE_NAME).gcda
file is created, containing branch counters.
The created *.gcno and *.gcda files can be processed using
the "llvm-cov gcov" and "lcov" tools. The profiling data LLVM
generates does not faithfully follow the GCC's format for *.gcno
and *.gcda files, and so it will probably not work with other tools
(such as gcov itself) that consume these files.
Replaces the llvm-c exposed LLVMRelocMode, which does not include all
relocation model variants, with a LLVMRustRelocMode modeled after
LLVMRustCodeMode.
Tracking issue: https://github.com/rust-lang/rust/issues/40180
This calling convention can be used for definining interrupt handlers on
32-bit and 64-bit x86 targets. The compiler then uses `iret` instead of
`ret` for returning and ensures that all registers are restored to their
original values.
Usage:
```
extern "x86-interrupt" fn handler(stack_frame: &ExceptionStackFrame) {…}
```
for interrupts and exceptions without error code and
```
extern "x86-interrupt" fn page_fault_handler(stack_frame: &ExceptionStackFrame,
error_code: u64) {…}
```
for exceptions that push an error code (e.g., page faults or general
protection faults). The programmer must ensure that the correct version
is used for each interrupt.
For more details see the [LLVM PR][1] and the corresponding [proposal][2].
[1]: https://reviews.llvm.org/D15567
[2]: http://lists.llvm.org/pipermail/cfe-dev/2015-September/045171.html
[MIR] SwitchInt Everywhere
Something I've been meaning to do for a very long while. This PR essentially gets rid of 3 kinds of conditional branching and only keeps the most general one - `SwitchInt`. Primary benefits are such that dealing with MIR now does not involve dealing with 3 different ways to do conditional control flow. On the other hand, constructing a `SwitchInt` currently requires more code than what previously was necessary to build an equivalent `If` terminator. Something trivially "fixable" with some constructor methods somewhere (MIR needs stuff like that badly in general).
Some timings (tl;dr: slightly faster^1 (unexpected), but also uses slightly more memory at peak (expected)):
^1: Not sure if the speed benefits are because of LLVM liking the generated code better or the compiler itself getting compiled better. Either way, its a net benefit. The CORE and SYNTAX timings done for compilation without optimisation.
```
AFTER:
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 31.50 secs
Finished release [optimized] target(s) in 31.42 secs
Building stage1 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 439.56 secs
Finished release [optimized] target(s) in 435.15 secs
CORE: 99% (24.81 real, 0.13 kernel, 24.57 user); 358536k resident
CORE: 99% (24.56 real, 0.15 kernel, 24.36 user); 359168k resident
SYNTAX: 99% (49.98 real, 0.48 kernel, 49.42 user); 653416k resident
SYNTAX: 99% (50.07 real, 0.58 kernel, 49.43 user); 653604k resident
BEFORE:
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 31.84 secs
Building stage1 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 451.17 secs
CORE: 99% (24.66 real, 0.20 kernel, 24.38 user); 351096k resident
CORE: 99% (24.36 real, 0.17 kernel, 24.18 user); 352284k resident
SYNTAX: 99% (52.24 real, 0.56 kernel, 51.66 user); 645544k resident
SYNTAX: 99% (51.55 real, 0.48 kernel, 50.99 user); 646428k resident
```
cc @nikomatsakis @eddyb
Emit DW_AT_main_subprogram
This changes rustc to emit DW_AT_main_subprogram on the "main" program.
This lets gdb suitably stop at the user's main in response to
"start" (rather than the library's main, which is what happens
currently).
Fixes#32620
r? michaelwoerister
This changes rustc to emit DW_AT_main_subprogram on the "main" program.
This lets gdb suitably stop at the user's main in response to
"start" (rather than the library's main, which is what happens
currently).
Fixes#32620
r? michaelwoerister
Instead of directly creating a 'DIGlobalVariable', we now have to create
a 'DIGlobalVariableExpression' which itself contains a reference to a
'DIGlobalVariable'.
This is a straightforward change.
In the future, we should rename 'DIGlobalVariable' in the FFI
bindings, assuming we will only refer to 'DIGlobalVariableExpression'
and not 'DIGlobalVariable'.
LLVM Core C bindings provide this function for all the versions back to what we support (3.7), and
helps to avoid this unnecessary builder->function transition every time. Also a negative diff.
Since discriminants do not support i128 yet, lets just calculate the boundaries within the 64 bits
that are supported. This also avoids an issue with bootstrapping on 32 bit systems due to #38727.
Fixes rebase fallout, makes code correct in presence of 128-bit constants.
This commit includes manual merge conflict resolution changes from a rebase by @est31.
This commit introduces 128-bit integers. Stage 2 builds and produces a working compiler which
understands and supports 128-bit integers throughout.
The general strategy used is to have rustc_i128 module which provides aliases for iu128, equal to
iu64 in stage9 and iu128 later. Since nowhere in rustc we rely on large numbers being supported,
this strategy is good enough to get past the first bootstrap stages to end up with a fully working
128-bit capable compiler.
In order for this strategy to work, number of locations had to be changed to use associated
max_value/min_value instead of MAX/MIN constants as well as the min_value (or was it max_value?)
had to be changed to use xor instead of shift so both 64-bit and 128-bit based consteval works
(former not necessarily producing the right results in stage1).
This commit includes manual merge conflict resolution changes from a rebase by @est31.
PTX support, take 2
- You can generate PTX using `--emit=asm` and the right (custom) target. Which
then you can run on a NVIDIA GPU.
- You can compile `core` to PTX. [Xargo] also works and it can compile some
other crates like `collections` (but I doubt all of those make sense on a GPU)
[Xargo]: https://github.com/japaric/xargo
- You can create "global" functions, which can be "called" by the host, using
the `"ptx-kernel"` ABI, e.g. `extern "ptx-kernel" fn kernel() { .. }`. Every
other function is a "device" function and can only be called by the GPU.
- Intrinsics like `__syncthreads()` and `blockIdx.x` are available as
`"platform-intrinsics"`. These intrinsics are *not* in the `core` crate but
any Rust user can create "bindings" to them using an `extern
"platform-intrinsics"` block. See example at the end.
- Trying to emit PTX with `-g` (debuginfo); you get an LLVM error. But I don't
think PTX can contain debuginfo anyway so `-g` should be ignored and a warning
should be printed ("`-g` doesn't work with this target" or something).
- "Single source" support. You *can't* write a single source file that contains
both host and device code. I think that should be possible to implement that
outside the compiler using compiler plugins / build scripts.
- The equivalent to CUDA `__shared__` which it's used to declare memory that's
shared between the threads of the same block. This could be implemented using
attributes: `#[shared] static mut SCRATCH_MEMORY: [f32; 64]` but hasn't been
implemented yet.
- Built-in targets. This PR doesn't add targets to the compiler just yet but one
can create custom targets to be able to emit PTX code (see the example at the
end). The idea is to have people experiment with this feature before
committing to it (built-in targets are "insta-stable")
- All functions must be "inlined". IOW, the `.rlib` must always contain the LLVM
bitcode of all the functions of the crate it was produced from. Otherwise, you
end with "undefined references" in the final PTX code but you won't get *any*
linker error because no linker is involved. IOW, you'll hit a runtime error
when loading the PTX into the GPU. The workaround is to use `#[inline]` on
non-generic functions and to never use `#[inline(never)]` but this may not
always be possible because e.g. you could be relying on third party code.
- Should `--emit=asm` generate a `.ptx` file instead of a `.s` file?
TL;DR Use Xargo to turn a crate into a PTX module (a `.s` file). Then pass that
PTX module, as a string, to the GPU and run it.
The full code is in [this repository]. This section gives an overview of how to
run Rust code on a NVIDIA GPU.
[this repository]: https://github.com/japaric/cuda
- Create a custom target. Here's the 64-bit NVPTX target (NOTE: the comments
are not valid because this is supposed to be a JSON file; remove them before
you use this file):
``` js
// nvptx64-nvidia-cuda.json
{
"arch": "nvptx64", // matches LLVM
"cpu": "sm_20", // "oldest" compute capability supported by LLVM
"data-layout": "e-i64:64-v16:16-v32:32-n16:32:64",
"llvm-target": "nvptx64-nvidia-cuda",
"max-atomic-width": 0, // LLVM errors with any other value :-(
"os": "cuda", // matches LLVM
"panic-strategy": "abort",
"target-endian": "little",
"target-pointer-width": "64",
"target-vendor": "nvidia", // matches LLVM -- not required
}
```
(There's a 32-bit target specification in the linked repository)
- Write a kernel
``` rust
extern "platform-intrinsic" {
fn nvptx_block_dim_x() -> i32;
fn nvptx_block_idx_x() -> i32;
fn nvptx_thread_idx_x() -> i32;
}
/// Copies an array of `n` floating point numbers from `src` to `dst`
pub unsafe extern "ptx-kernel" fn memcpy(dst: *mut f32,
src: *const f32,
n: usize) {
let i = (nvptx_block_dim_x() as isize)
.wrapping_mul(nvptx_block_idx_x() as isize)
.wrapping_add(nvptx_thread_idx_x() as isize);
if (i as usize) < n {
*dst.offset(i) = *src.offset(i);
}
}
```
- Emit PTX code
```
$ xargo rustc --target nvptx64-nvidia-cuda --release -- --emit=asm
Compiling core v0.0.0 (file://..)
(..)
Compiling nvptx-builtins v0.1.0 (https://github.com/japaric/nvptx-builtins)
Compiling kernel v0.1.0
$ cat target/nvptx64-nvidia-cuda/release/deps/kernel-*.s
//
// Generated by LLVM NVPTX Back-End
//
.version 3.2
.target sm_20
.address_size 64
// .globl memcpy
.visible .entry memcpy(
.param .u64 memcpy_param_0,
.param .u64 memcpy_param_1,
.param .u64 memcpy_param_2
)
{
.reg .pred %p<2>;
.reg .s32 %r<5>;
.reg .s64 %rd<12>;
ld.param.u64 %rd7, [memcpy_param_2];
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mul.wide.s32 %rd8, %r2, %r1;
mov.u32 %r3, %tid.x;
cvt.s64.s32 %rd9, %r3;
add.s64 %rd10, %rd9, %rd8;
setp.ge.u64 %p1, %rd10, %rd7;
@%p1 bra LBB0_2;
ld.param.u64 %rd3, [memcpy_param_0];
ld.param.u64 %rd4, [memcpy_param_1];
cvta.to.global.u64 %rd5, %rd4;
cvta.to.global.u64 %rd6, %rd3;
shl.b64 %rd11, %rd10, 2;
add.s64 %rd1, %rd6, %rd11;
add.s64 %rd2, %rd5, %rd11;
ld.global.u32 %r4, [%rd2];
st.global.u32 [%rd1], %r4;
LBB0_2:
ret;
}
```
- Run it on the GPU
``` rust
// `kernel.ptx` is the `*.s` file we got in the previous step
const KERNEL: &'static str = include_str!("kernel.ptx");
driver::initialize()?;
let device = Device(0)?;
let ctx = device.create_context()?;
let module = ctx.load_module(KERNEL)?;
let kernel = module.function("memcpy")?;
let h_a: Vec<f32> = /* create some random data */;
let h_b = vec![0.; N];
let d_a = driver::allocate(bytes)?;
let d_b = driver::allocate(bytes)?;
// Copy from host to GPU
driver::copy(h_a, d_a)?;
// Run `memcpy` on the GPU
kernel.launch(d_b, d_a, N)?;
// Copy from GPU to host
driver::copy(d_b, h_b)?;
// Verify
assert_eq!(h_a, h_b);
// `d_a`, `d_b`, `h_a`, `h_b` are dropped/freed here
```
---
cc @alexcrichton @brson @rkruppe
> What has changed since #34195?
- `core` now can be compiled into PTX. Which makes it very easy to turn `no_std`
crates into "kernels" with the help of Xargo.
- There's now a way, the `"ptx-kernel"` ABI, to generate "global" functions. The
old PR required a manual step (it was hack) to "convert" "device" functions
into "global" functions. (Only "global" functions can be launched by the host)
- Everything is unstable. There are not "insta stable" built-in targets this
time (\*). The users have to use a custom target to experiment with this
feature. Also, PTX instrinsics, like `__syncthreads` and `blockIdx.x`, are now
implemented as `"platform-intrinsics"` so they no longer live in the `core`
crate.
(\*) I'd actually like to have in-tree targets because that makes this target
more discoverable, removes the need to lug around .json files, etc.
However, bundling a target with the compiler immediately puts it in the path
towards stabilization. Which gives us just two cycles to find and fix any
problem with the target specification. Afterwards, it becomes hard to tweak
the specification because that could be a breaking change.
A possible solution could be "unstable built-in targets". Basically, to use an
unstable target, you'll have to also pass `-Z unstable-options` to the compiler.
And unstable targets, being unstable, wouldn't be available on stable.
> Why should this be merged?
- To let people experiment with the feature out of tree. Having easy access to
the feature (in every nightly) allows this. I also think that, as it is, it
should be possible to start prototyping type-safe single source support using
build scripts, macros and/or plugins.
- It's a straightforward implementation. No different that adding support for
any other architecture.
- `--emit=asm --target=nvptx64-nvidia-cuda` can be used to turn a crate
into a PTX module (a `.s` file).
- intrinsics like `__syncthreads` and `blockIdx.x` are exposed as
`"platform-intrinsics"`.
- "cabi" has been implemented for the nvptx and nvptx64 architectures.
i.e. `extern "C"` works.
- a new ABI, `"ptx-kernel"`. That can be used to generate "global"
functions. Example: `extern "ptx-kernel" fn kernel() { .. }`. All
other functions are "device" functions.
Alignment was removed from createBasicType and moved to
- createGlobalVariable
- createAutoVariable
- createStaticMemberType (unused in Rust)
- createTempGlobalVariableFwdDecl (unused in Rust)
e69c459a6e