PR libgomp/90585
* plugin/plugin-hsa.c: Include gstdint.h. Include inttypes.h only if
HAVE_INTTYPES_H is defined.
(print_uint64_t): New typedef.
(PRIu64): Define if HAVE_INTTYPES_H is not defined.
(print_kernel_dispatch, run_kernel): Use PRIu64 macro instead of
"lu", cast uint64_t HSA_DEBUG and fprintf arguments to print_uint64_t.
(release_kernel_dispatch): Likewise. Cast shadow->debug to uintptr_t
before casting to void *.
* plugin/plugin-nvptx.c: Include gstdint.h instead of stdint.h.
* oacc-mem.c: Don't include config.h nor stdint.h.
* target.c: Don't include config.h.
* oacc-cuda.c: Likewise.
* oacc-host.c: Don't include stdint.h.
From-SVN: r271597
I wrote a test-case:
...
int
main (void)
{
for (unsigned i = 0; i < 128; ++i)
{
acc_init (acc_device_nvidia);
acc_shutdown (acc_device_nvidia);
}
return 0;
}
...
and ran it under valgrind. The only leak location reported with a frequency
of 128, was the allocation of ptx_devices in nvptx_init.
Fix this by freeing ptx_devices in GOMP_OFFLOAD_fini_device, once
instantiated_devices drops to 0.
2019-01-24 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_fini_device): Free ptx_devices
once instantiated_devices drops to 0.
From-SVN: r268237
Consider test-case:
...
int
main (void)
{
#pragma acc parallel async
;
#pragma acc parallel async
;
#pragma acc wait
return 0;
}
...
This fails with:
...
libgomp: cuMemAlloc error: invalid argument
Segmentation fault (core dumped)
...
The cuMemAlloc error is due to the fact that we're try to allocate 0 bytes.
Fix this by preventing calling map_push with size zero argument in nvptx_exec.
This also has the consequence that for the abort-1.c test-case, we end up
calling cuMemFree during map_fini for the struct cuda_map allocated in
map_init, which fails because an abort happened. Fix this by calling
cuMemFree with CUDA_CALL_NOCHECK in cuda_map_destroy.
2019-01-23 Tom de Vries <tdevries@suse.de>
PR target/PR88946
* plugin/plugin-nvptx.c (cuda_map_destroy): Use CUDA_CALL_NOCHECK for
cuMemFree.
(nvptx_exec): Don't call map_push if mapnum == 0.
* testsuite/libgomp.oacc-c-c++-common/pr88946.c: New test.
From-SVN: r268178
There are currently two situations where this assert triggers:
...
libgomp/plugin/plugin-nvptx.c: map_fini: Assertion `!s->map->active' failed.
...
First, in abort-1.c, a parallel region triggering an abort:
...
int
main (void)
{
#pragma acc parallel
abort ();
return 0;
}
...
The abort is detected in nvptx_exec as the CUDA_ERROR_ILLEGAL_INSTRUCTION
return status of the cuStreamSynchronize call after kernel launch, which is
then handled by calling non-returning function GOMP_PLUGIN_fatal.
Consequently, the map_pop in nvptx_exec that in case of cuStreamSynchronize
success would remove or inactive the element added by the map_push earlier in
nvptx_exec, does not trigger. With the element no longer active, but still
marked active and a member of s->map, we run into the assert during
GOMP_OFFLOAD_fini_device, which is triggered from atexit handler
gomp_target_fini (which is triggered by the GOMP_PLUGIN_fatal mentioned above
calling exit).
Second, in pr88941.c, an async parallel region without wait:
...
int
main (void)
{
#pragma acc parallel async
;
/* no #pragma acc wait */
return 0;
}
...
Because nvptx_exec is handling an async region, it does not call map_pop for
the element added by map_push, but schedules an kernel execution completion
event to call map_pop. Again, we run into the assert during
GOMP_OFFLOAD_fini_device, which is triggered from atexit handler
gomp_target_fini, but the exit in this case is triggered by returning from main.
So either the kernel is still running, or the kernel has completed but the
corresponding event that is supposed to call map_pop is stuck in the event
queue, waiting for an event_gc.
Fix this by removing the assert, and skipping the freeing of device memory if
the map is still marked active (though in the async case, this is more a
workaround than an fix).
2019-01-23 Tom de Vries <tdevries@suse.de>
PR target/88941
PR target/88939
* plugin/plugin-nvptx.c (cuda_map_destroy): Handle map->active case.
(map_fini): Remove "assert (!s->map->active)".
* testsuite/libgomp.oacc-c-c++-common/pr88941.c: New test.
From-SVN: r268177
The map field of a struct ptx_stream is a FIFO. The FIFO is implemented as a
single linked list, with pop-from-the-front semantics.
The function map_pop pops an element, either by:
- deallocating the element, if there is more than one element
- or marking the element inactive, if there's only one element
The responsibility of map_push is to push an element to the back, as well as
selecting the element to push, by:
- allocating an element, or
- reusing the element at the front if inactive and big enough, or
- dropping the element at the front if inactive and not big enough, and
allocating one that's big enough
The current implemention gets at least the first and most basic scenario wrong:
> map = cuda_map_create (size);
We create an element, and assign it to map.
> for (t = s->map; t->next != NULL; t = t->next)
> ;
We determine the last element in the fifo.
> t->next = map;
We append the new element.
> s->map = map;
But here, we throw away the rest of the FIFO, and declare the FIFO to be just
the new element.
This problem causes the test-case asyncwait-1.c to fail intermittently on some
systems. The pr87835.c test-case added here is a a minimized and modified
version of asyncwait-1.c (avoiding the kernel construct) that is more likely to
fail.
Fix this by rewriting map_pop more robustly, by:
- seperating the function in two phases: select element, push element
- when reusing or dropping an element, making sure that the element is cleanly
popped from the queue
- rewriting the push element part in such a way that it can handle all cases
without needing if statements, such that each line is exercised for each of
the three cases.
2019-01-23 Tom de Vries <tdevries@suse.de>
PR target/87835
* plugin/plugin-nvptx.c (map_push): Fix adding of allocated element.
* testsuite/libgomp.oacc-c-c++-common/pr87835.c: New test.
From-SVN: r268176
Update message in nvptx libgomp plugin about insufficient resources to launch
kernel, to accommodate for the fact the vector_length can now be variable.
2019-01-12 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (nvptx_exec): Update insufficient hardware
resources diagnostic.
From-SVN: r267890
When using a compiler build with:
...
+#define PTX_DEFAULT_VECTOR_LENGTH PTX_CTA_SIZE
...
consider a test-case:
...
int
main (void)
{
#pragma acc parallel vector_length (64)
#pragma acc loop worker
for (unsigned int i = 0; i < 32; i++)
#pragma acc loop vector
for (unsigned int j = 0; j < 64; j++)
;
return 0;
}
...
If num_workers is 16, either because:
- we add a "num_workers (16)" clause on the parallel directive, or
- we set "GOMP_OPENACC_DIM=:16:", or
- the libgomp plugin chooses 16 num_workers
we run into an illegal instruction at runtime, because a bar.sync instruction
tries to use a barrier 16. The instruction is illegal, because ptx supports
only 16 barriers per CTA, and the valid range is 0..15.
The problem is that with a warp-multiple vector length, we use a code generation
scheme with a per-worker barrier. And because barrier zero is reserved for
per-cta barrier, only the remaining 15 barriers can be used as per-worker
barrier, and consequently we can't use num_workers larger than 15.
This problem occurs only for vector_length 64. For vector_length 32, we use a
different code generation scheme, and for vector_length >= 96, the maximum
num_workers is not big enough not to trigger this problem.
Also, this problem only occurs for num_workers 16. As explained above,
num_workers 15 is safe to use, and 16 is already the maximum num_workers for
vector_length 64.
This patch fixes the problem in both the compiler (handling "num_workers (16)")
and in the libgomp nvptx plugin (with and without "GOMP_OPENACC_DIM=:16:").
2019-01-11 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.c (PTX_CTA_NUM_BARRIERS, PTX_PER_CTA_BARRIER)
(PTX_NUM_PER_CTA_BARRIER, PTX_FIRST_PER_WORKER_BARRIER)
(PTX_NUM_PER_WORKER_BARRIERS): Define.
(nvptx_apply_dim_limits): Prevent vector_length 64 and
num_workers 16.
* plugin/plugin-nvptx.c (nvptx_exec): Prevent vector_length 64 and
num_workers 16.
From-SVN: r267838
When using a compiler build with:
...
+#define PTX_DEFAULT_VECTOR_LENGTH PTX_CTA_SIZE
+#define PTX_MAX_VECTOR_LENGTH PTX_CTA_SIZE
...
and running the libgomp testsuite, we run into an execution failure in
parallel-loop-1.c, due to a cuda launch failure:
...
nvptx_exec: kernel f6_none_none$_omp_fn$0: launch gangs=480, workers=0, \
vectors=1024
libgomp: cuLaunchKernel error: invalid argument
...
because workers == 0.
The workers variable is set to 0 here in nvptx_exec:
...
workers = blocks / actual_vectors;
...
because actual_vectors is 1024, and blocks is 768:
...
cuOccupancyMaxPotentialBlockSize: grid = 10, block = 768
...
Fix this by ensuring that workers is at least one.
2019-01-09 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (nvptx_exec): Make sure to launch with at least
one worker.
From-SVN: r267746
An OpenACC async queue is always synchronized with itself, so invocations like
"#pragma acc wait(0) async(0)", or "acc_wait_async (0, 0)" don't make a lot of
sense, but are still valid.
libgomp/
PR libgomp/88495
* plugin/plugin-nvptx.c (nvptx_wait_async): Don't refuse
"identical parameters".
* testsuite/libgomp.oacc-c-c++-common/asyncwait-nop-1.c: Update.
* testsuite/libgomp.oacc-c-c++-common/lib-80.c: Remove.
From-SVN: r267152
Per my reading of the OpenACC specification (and as supported by secondary
documentation, such as code examples, or presentations), it's valid to call
"acc_get_cuda_stream"/"acc_set_cuda_stream" also with "acc_async_sync",
"acc_async_noval" arguments, not just with the nonnegative values as currently
implemented.
libgomp/
PR libgomp/88370
* libgomp.texi (acc_get_current_cuda_context, acc_get_cuda_stream)
(acc_set_cuda_stream): Clarify.
* oacc-cuda.c (acc_get_cuda_stream, acc_set_cuda_stream): Use
"async_valid_p".
* plugin/plugin-nvptx.c (nvptx_set_cuda_stream): Refuse "async ==
acc_async_sync".
* testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-1.c: New file.
* testsuite/libgomp.oacc-c-c++-common/async_queue-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/lib-84.c: Update.
* testsuite/libgomp.oacc-c-c++-common/lib-85.c: Likewise.
From-SVN: r267147
The CUDA driver API starting version 6.5 offers a set of runtime functions to
calculate several occupancy-related measures, as a replacement for the occupancy
calculator spreadsheet.
This patch adds a heuristic for default runtime launch geometry, based on the
new runtime function cuOccupancyMaxPotentialBlockSize.
Build on x86_64 with nvptx accelerator and ran libgomp testsuite.
2018-08-13 Cesar Philippidis <cesar@codesourcery.com>
Tom de Vries <tdevries@suse.de>
PR target/85590
* plugin/cuda/cuda.h (CUoccupancyB2DSize): New typedef.
(cuOccupancyMaxPotentialBlockSize): Declare.
* plugin/cuda-lib.def (cuOccupancyMaxPotentialBlockSize): New
CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-nvptx.c (CUDA_VERSION < 6050): Define
CUoccupancyB2DSize and declare
cuOccupancyMaxPotentialBlockSize.
(nvptx_exec): Use cuOccupancyMaxPotentialBlockSize to set the
default num_gangs and num_workers when the driver supports it.
Co-Authored-By: Tom de Vries <tdevries@suse.de>
From-SVN: r263505
Cuda driver api functions cuLinkAddData and cuLinkCreate are available starting
version 5.5. In version 6.5, they are remapped onto _v2 versions.
The dlopen interface of the libgomp nvptx plugin uses the _v2 versions, so it
won't work with a cuda driver with driver api version lower than 6.5.
This patch fixes the problem by testing for the presence of the _v2 versions,
and falling back to the original versions in case of absence of the _v2
versions.
Build on x86_64 with nvptx accelerator and reg-tested libgomp, both with and
without --without-cuda-driver.
2018-08-08 Tom de Vries <tdevries@suse.de>
* plugin/cuda-lib.def (cuLinkAddData_v2, cuLinkCreate_v2): Declare using
CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-nvptx.c (cuLinkAddData, cuLinkCreate): Undef and declare.
(cuLinkAddData_v2, cuLinkCreate_v2): Declare.
(link_ptx): Fall back to cuLinkAddData/cuLinkCreate if the _v2 versions
are not found.
From-SVN: r263408
Cuda driver api function cuGetErrorString is available in version 6.0 and
higher.
Currently, when the driver that is used does not contain this function, the
libgomp nvptx plugin will not build (PLUGIN_NVPTX_DYNAMIC == 0) or run
(PLUGIN_NVPTX_DYNAMIC == 1).
This patch fixes this problem by testing for the presence of the function, and
handling absence.
Build on x86_64 with nvptx accelerator and reg-tested libgomp, both with and
without --without-cuda-driver.
2018-08-08 Tom de Vries <tdevries@suse.de>
* plugin/cuda-lib.def (cuGetErrorString): Use CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-nvptx.c (cuda_error): Handle if cuGetErrorString is not
present.
From-SVN: r263407
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR is defined in cuda driver
api version 6.0 and higher.
Currently nvptx_open_device uses a hard-coded constant instead.
This patch fixes that by:
- defining CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR to the hardcoded
constant at toplevel, if not present in cuda.h, and
- using CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in nvptx_open_device
Build on x86_64 with nvptx accelerator and reg-tested libgomp.
2018-08-08 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c
(CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR): Define.
(nvptx_open_device): Use
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR.
From-SVN: r263406
Cuda driver api function cuGetErrorString is available in version 6.0 and
higher.
This patch:
- removes a comment saying the declaration is not available in cuda.h 6.0
- fixes the presence test to use CUDA_VERSION < 6000
- moves the declaration to toplevel
Build on x86_64 with nvptx accelerator and reg-tested libgomp.
2018-08-08 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (cuda_error): Move declaration of cuGetErrorString ...
(cuGetErrorString): ... here. Guard with CUDA_VERSION < 6000.
From-SVN: r263405
This patch adds handling of functions that may not be present in the cuda
driver.
Such a function can be declared using CUDA_ONE_CALL_MAYBE_NULL in cuda-lib.def,
it can be called with the usual convenience macros, but before calling its
presence needs to be tested using new macro CUDA_CALL_EXISTS.
When using the dlopen interface (PLUGIN_NVPTX_DYNAMIC == 1), we allow
non-present functions by allowing dlsym to return NULL. Otherwise
(PLUGIN_NVPTX_DYNAMIC == 0) we declare the non-present function to be weak.
Build and reg-tested libgomp on x86_64 with nvidia accelerator, with and without
--disable-cuda-driver, in combination with a trigger patch that adds a
non-existing function foo to cuda-lib.def:
...
CUDA_ONE_CALL_MAYBE_NULL (foo)
...
and declares it in plugin-nvptx.c:
...
CUresult foo (void);
...
and then uses it in nvptx_init after the init_cuda_lib call:
...
if (CUDA_CALL_EXISTS (foo))
CUDA_CALL (foo);
...
Also build and reg-tested on x86_64 with nvidia accelerator, with and without
--disable-cuda-driver, in combination with a trigger patch that replaces all
CUDA_ONE_CALLs in cuda-lib.def with CUDA_ONE_CALL_MAYBE_NULL, and guards two
CUDA_CALLs with CUDA_CALL_EXISTS, one for a regular fn, and one for a fn that is
a define in cuda/cuda.h.
2018-08-07 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (DO_PRAGMA): Define.
(struct cuda_lib_s): Add def/undef of CUDA_ONE_CALL_MAYBE_NULL.
(init_cuda_lib): Add new param to CUDA_ONE_CALL_1. Add arg to
corresponding call in CUDA_ONE_CALL. Add def/undef of
CUDA_ONE_CALL_MAYBE_NULL.
(CUDA_CALL_EXISTS): Define.
From-SVN: r263346
This patch makes sure that the lifetimes of the CUDA_ONE_CALL macro (which is
defined twice in plugin-nvptx.c) are minimized, to make it obvious that the
definitions are used only in the lib-cuda.def include.
Build on x86_64 with nvptx accelerator and reg-tested libgomp.
2018-08-07 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (struct cuda_lib_s, init_cuda_lib): Put
CUDA_ONE_CALL defines right before the cuda-lib.def include, and the
corresponding undefs right after.
From-SVN: r263345
Using libgomp configure option --with-cuda-driver=<dir> we can indicate what
cuda driver to use to build the libgomp nvptx plugin. Without such an option,
the system cuda driver is used, if available. If not availabe, a dlopen
interface is used instead.
However, when we use --without-cuda-driver (or the equivalent
--with-cuda-driver=no) the system cuda driver is still used if available.
This patch fixes that, making sure that --without-cuda-driver selects the dlopen
interface.
Build on x86_64 with nvptx accelerator and tested libgomp testsuite, with and
without option --without-cuda-driver.
2018-08-04 Tom de Vries <tdevries@suse.de>
* plugin/configfrag.ac: For --without-cuda-driver, set
CUDA_DRIVER_INCLUDE and CUDA_DRIVER_LIB to no. Handle
CUDA_DRIVER_INCLUDE == no and CUDA_DRIVER_LIB == no.
* configure: Regenerate.
From-SVN: r263310
2018-08-01 Tom de Vries <tdevries@suse.de>
* plugin/cuda-lib.def: New file. Factor out of ...
* plugin/plugin-nvptx.c (CUDA_CALLS): ... here.
(struct cuda_lib_s, init_cuda_lib): Include cuda-lib.def instead of
using CUDA_CALLS.
From-SVN: r263208
Currently parallel-loop-1.c fails at -O0 on a Quadro M1200, because one of the
kernel launch configurations exceeds the resources available in the device, due
to the default dimensions chosen by the runtime.
This patch fixes that by taking the per-function max_threads_per_block into
account when using the default dimensions.
2018-07-30 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (MIN, MAX): Redefine.
(nvptx_exec): Ensure worker and vector default dims don't exceed
targ_fn->max_threads_per_block.
From-SVN: r263062
The default dimensions are calculated using per-device properties, but
initialized once and used on all devices.
This patch fixes this problem by introducing per-device default dimensions.
2018-07-30 Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (struct ptx_device): Add default_dims field.
(nvptx_open_device): Init default_dims for device.
(nvptx_exec): Use default_dims from device.
From-SVN: r263061
Currently, when a kernel is lauched with too many workers, it results in a cuda
launch failure. This is triggered f.i. for parallel-loop-1.c at -O0 on a Quadro
M1200.
This patch detects this situation, and errors out with a hint on how to fix it.
Build and reg-tested on x86_64 with nvptx accelerator.
2018-07-26 Cesar Philippidis <cesar@codesourcery.com>
Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (nvptx_exec): Error if the hardware doesn't have
sufficient resources to launch a kernel, and give a hint on how to fix
it.
Co-Authored-By: Tom de Vries <tdevries@suse.de>
From-SVN: r262997
Move sampling of device properties from nvptx_exec to nvptx_open, and assume
the sampling always succeeds. This simplifies the default dimension
initialization code in nvptx_open.
2018-07-26 Cesar Philippidis <cesar@codesourcery.com>
Tom de Vries <tdevries@suse.de>
* plugin/plugin-nvptx.c (struct ptx_device): Add warp_size,
max_threads_per_block and max_threads_per_multiprocessor fields.
(nvptx_open_device): Initialize new fields.
(nvptx_exec): Use num_sms, and new fields.
Co-Authored-By: Tom de Vries <tdevries@suse.de>
From-SVN: r262996
2017-10-31 Tom de Vries <tom@codesourcery.com>
* plugin/plugin-hsa.c (HSA_LOG): Remove semicolon after
"do {} while (false)".
(init_single_kernel, GOMP_OFFLOAD_async_run): Add missing semicolon
after HSA_DEBUG call.
From-SVN: r254264
2017-06-27 Tom de Vries <tom@codesourcery.com>
* plugin/plugin-nvptx.c (notify_var): New function.
(nvptx_exec): Use notify_var for GOMP_OPENACC_DIM.
From-SVN: r249695
2017-06-27 Tom de Vries <tom@codesourcery.com>
* env.c (parse_unsigned_long_1): Factor out of ...
(parse_unsigned_long): ... here.
(parse_int_1): Factor out of ...
(parse_int): ... here.
(parse_int_secure): New function.
(initialize_env): Use parse_int_secure for GOMP_DEBUG.
* secure_getenv.h: Factor out of ...
* plugin/plugin-hsa.c: ... here.
* testsuite/libgomp.oacc-c-c++-common/gomp-debug-env.c: New test.
From-SVN: r249694
libgomp/
* libgomp-plugin.h (GOMP_OFFLOAD_openacc_parallel): Rename to
GOMP_OFFLOAD_openacc_exec. Adjust all users.
(GOMP_OFFLOAD_openacc_get_current_cuda_device): Rename to
GOMP_OFFLOAD_openacc_cuda_get_current_device. Adjust all users.
(GOMP_OFFLOAD_openacc_get_current_cuda_context): Rename to
GOMP_OFFLOAD_openacc_cuda_get_current_context. Adjust all users.
(GOMP_OFFLOAD_openacc_get_cuda_stream): Rename to
GOMP_OFFLOAD_openacc_cuda_get_stream. Adjust all users.
(GOMP_OFFLOAD_openacc_set_cuda_stream): Rename to
GOMP_OFFLOAD_openacc_cuda_set_stream. Adjust all users.
From-SVN: r245125
* plugin/configfrag.ac: For --without-cuda-driver don't initialize
CUDA_DRIVER_INCLUDE nor CUDA_DRIVER_LIB. If both
CUDA_DRIVER_INCLUDE and CUDA_DRIVER_LIB are empty and linking small
cuda program fails, define PLUGIN_NVPTX_DYNAMIC to 1 and use
plugin/include/cuda as include dir and -ldl instead of -lcuda as
library to link ptx plugin against.
* plugin/plugin-nvptx.c: Include dlfcn.h if PLUGIN_NVPTX_DYNAMIC.
(CUDA_CALLS): Define.
(cuda_lib, cuda_lib_inited): New variables.
(init_cuda_lib): New function.
(CUDA_CALL_PREFIX): Define.
(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use CUDA_CALL_PREFIX.
(CUDA_CALL): Use FN instead of (FN).
(CUDA_CALL_NOCHECK): Define.
(cuda_error, fini_streams_for_device, select_stream_for_async,
nvptx_attach_host_thread_to_device, nvptx_open_device, link_ptx,
event_gc, nvptx_exec, nvptx_async_test, nvptx_async_test_all,
nvptx_wait_all, nvptx_set_clocktick, GOMP_OFFLOAD_unload_image,
nvptx_stacks_alloc, nvptx_stacks_free, GOMP_OFFLOAD_run): Use
CUDA_CALL_NOCHECK.
(nvptx_init): Call init_cuda_lib, if it fails, return false. Use
CUDA_CALL_NOCHECK.
(nvptx_get_num_devices): Call init_cuda_lib, if it fails, return 0.
Use CUDA_CALL_NOCHECK.
* plugin/cuda/cuda.h: New file.
* config.h.in: Regenerated.
* configure: Regenerated.
From-SVN: r244522