211 lines
10 KiB
HTML
211 lines
10 KiB
HTML
|
<HTML>
|
||
|
<HEAD>
|
||
|
<TITLE>Garbage collector scalability</TITLE>
|
||
|
</HEAD>
|
||
|
<BODY>
|
||
|
<H1>Garbage collector scalability</h1>
|
||
|
In its default configuration, the Boehm-Demers-Weiser garbage collector
|
||
|
is not thread-safe. It can be made thread-safe for a number of environments
|
||
|
by building the collector with the appropriate
|
||
|
<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
|
||
|
flag. This has primarily two effects:
|
||
|
<OL>
|
||
|
<LI> It causes the garbage collector to stop all other threads when
|
||
|
it needs to see a consistent memory state.
|
||
|
<LI> It causes the collector to acquire a lock around essentially all
|
||
|
allocation and garbage collection activity.
|
||
|
</ol>
|
||
|
Since a single lock is used for all allocation-related activity, only one
|
||
|
thread can be allocating or collecting at one point. This inherently
|
||
|
limits performance of multi-threaded applications on multiprocessors.
|
||
|
<P>
|
||
|
On most platforms, the allocator/collector lock is implemented as a
|
||
|
spin lock with exponential back-off. Longer wait times are implemented
|
||
|
by yielding and/or sleeping. If a collection is in progress, the pure
|
||
|
spinning stage is skipped. This has the advantage that uncontested and
|
||
|
thus most uniprocessor lock acquisitions are very cheap. It has the
|
||
|
disadvantage that the application may sleep for small periods of time
|
||
|
even when there is work to be done. And threads may be unnecessarily
|
||
|
woken up for short periods. Nonetheless, this scheme empirically
|
||
|
outperforms native queue-based mutual exclusion implementations in most
|
||
|
cases, sometimes drastically so.
|
||
|
<H2>Options for enhanced scalability</h2>
|
||
|
Version 6.0 of the collector adds two facilities to enhance collector
|
||
|
scalability on multiprocessors. As of 6.0alpha1, these are supported
|
||
|
only under Linux on X86 and IA64 processors, though ports to other
|
||
|
otherwise supported Pthreads platforms should be straightforward.
|
||
|
They are intended to be used together.
|
||
|
<UL>
|
||
|
<LI>
|
||
|
Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
|
||
|
run the mark phase in parallel in multiple threads, and thus on multiple
|
||
|
processors. The mark phase typically consumes the large majority of the
|
||
|
collection time. Thus this largely parallelizes the garbage collector
|
||
|
itself, though not the allocation process. Currently the marking is
|
||
|
performed by the thread that triggered the collection, together with
|
||
|
<I>N</i>-1 dedicated
|
||
|
threads, where <I>N</i> is the number of processors detected by the collector.
|
||
|
The dedicated threads are created once at initialization time.
|
||
|
<P>
|
||
|
A second effect of this flag is to switch to a more concurrent
|
||
|
implementation of <TT>GC_malloc_many</tt>, so that free lists can be
|
||
|
built, and memory can be cleared, by more than one thread concurrently.
|
||
|
<LI>
|
||
|
Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
|
||
|
local allocation. It does not, by itself, cause thread local allocation
|
||
|
to be used. It simply allows the use of the interface in
|
||
|
<TT>gc_local_alloc.h</tt>.
|
||
|
<P>
|
||
|
Memory returned from thread-local allocators is completely interchangeable
|
||
|
with that returned by the standard allocators. It may be used by other
|
||
|
threads. The only difference is that, if the thread allocates enough
|
||
|
memory of a certain kind, it will build a thread-local free list for
|
||
|
objects of that kind, and allocate from that. This greatly reduces
|
||
|
locking. The thread-local free lists are refilled using
|
||
|
<TT>GC_malloc_many</tt>.
|
||
|
<P>
|
||
|
An important side effect of this flag is to replace the default
|
||
|
spin-then-sleep lock to be replace by a spin-then-queue based implementation.
|
||
|
This <I>reduces performance</i> for the standard allocation functions,
|
||
|
though it usually improves performance when thread-local allocation is
|
||
|
used heavily, and thus the number of short-duration lock acquisitions
|
||
|
is greatly reduced.
|
||
|
</ul>
|
||
|
<P>
|
||
|
The easiest way to switch an application to thread-local allocation is to
|
||
|
<OL>
|
||
|
<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
|
||
|
and then include the <TT>gc.h</tt>
|
||
|
header in each client source file.
|
||
|
<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
|
||
|
<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
|
||
|
and/or <TT>GC_GCJ_MALLOC</tt>.
|
||
|
</ol>
|
||
|
<H2>The Parallel Marking Algorithm</h2>
|
||
|
We use an algorithm similar to
|
||
|
<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
|
||
|
Endo, Taura, and Yonezawa</a> at the University of Tokyo.
|
||
|
However, the data structures and implementation are different,
|
||
|
and represent a smaller change to the original collector source,
|
||
|
probably at the expense of extreme scalability. Some of
|
||
|
the refinements they suggest, <I>e.g.</i> splitting large
|
||
|
objects, were also incorporated into out approach.
|
||
|
<P>
|
||
|
The global mark stack is transformed into a global work queue.
|
||
|
Unlike the usual case, it never shrinks during a mark phase.
|
||
|
The mark threads remove objects from the queue by copying them to a
|
||
|
local mark stack and changing the global descriptor to zero, indicating
|
||
|
that there is no more work to be done for this entry.
|
||
|
This removal
|
||
|
is done with no synchronization. Thus it is possible for more than
|
||
|
one worker to remove the same entry, resulting in some work duplication.
|
||
|
<P>
|
||
|
The global work queue grows only if a marker thread decides to
|
||
|
return some of its local mark stack to the global one. This
|
||
|
is done if the global queue appears to be running low, or if
|
||
|
the local stack is in danger of overflowing. It does require
|
||
|
synchronization, but should be relatively rare.
|
||
|
<P>
|
||
|
The sequential marking code is reused to process local mark stacks.
|
||
|
Hence the amount of additional code required for parallel marking
|
||
|
is minimal.
|
||
|
<P>
|
||
|
It should be possible to use generational collection in the presence of the
|
||
|
parallel collector, by calling <TT>GC_enable_incremental()</tt>.
|
||
|
This does not result in fully incremental collection, since parallel mark
|
||
|
phases cannot currently be interrupted, and doing so may be too
|
||
|
expensive.
|
||
|
<P>
|
||
|
Gcj-style mark descriptors do not currently mix with the combination
|
||
|
of local allocation and incremental collection. They should work correctly
|
||
|
with one or the other, but not both.
|
||
|
<P>
|
||
|
The number of marker threads is set on startup to the number of
|
||
|
available processors (or to the value of the <TT>GC_NPROCS</tt>
|
||
|
environment variable). If only a single processor is detected,
|
||
|
parallel marking is disabled.
|
||
|
<P>
|
||
|
Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
|
||
|
the collector to immediately yield the processor instead of busy waiting
|
||
|
first. In the case of a multiprocessor and a client with multiple
|
||
|
simultaneously runnable threads, this may have disastrous performance
|
||
|
consequences (e.g. a factor of 10 slowdown).
|
||
|
<H2>Performance</h2>
|
||
|
We conducted some simple experiments with a version of
|
||
|
<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
|
||
|
run multiple concurrent client threads in the same address space.
|
||
|
Each client thread does the same work as the original benchmark, but they share
|
||
|
a heap.
|
||
|
This benchmark involves very little work outside of memory allocation.
|
||
|
This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
|
||
|
under Linux 2.2.12.
|
||
|
<P>
|
||
|
Running with a thread-unsafe collector, the benchmark ran in 9
|
||
|
seconds. With the simple thread-safe collector,
|
||
|
built with <TT>-DLINUX_THREADS</tt>, the execution time
|
||
|
increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
|
||
|
(The times for the <TT>malloc</tt>/i<TT>free</tt> version
|
||
|
with glibc <TT>malloc</tt>
|
||
|
are 10.51 (standard library, pthreads not linked),
|
||
|
20.90 (one thread, pthreads linked),
|
||
|
and 24.55 seconds respectively. The benchmark favors a
|
||
|
garbage collector, since most objects are small.)
|
||
|
<P>
|
||
|
The following table gives execution times for the collector built
|
||
|
with parallel marking and thread-local allocation support
|
||
|
(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
|
||
|
the client using either one or two marker threads, and running
|
||
|
one or two client threads. Note that the client uses thread local
|
||
|
allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
|
||
|
switches to a locking strategy that is better tuned to less frequent
|
||
|
lock acquisition. The standard allocation primitives thus peform
|
||
|
slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
|
||
|
avoided in time-critical code.
|
||
|
<P>
|
||
|
(The results using <TT>pthread_mutex_lock</tt>
|
||
|
directly for allocation locking would have been worse still, at
|
||
|
least for older versions of linuxthreads.
|
||
|
With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
|
||
|
lock with pthread_mutex_try_lock(), busy_waiting between attempts.
|
||
|
After a fixed number of attempts, we use pthread_mutex_lock().)
|
||
|
<P>
|
||
|
These measurements do not use incremental collection, nor was prefetching
|
||
|
enabled in the marker. We used the C version of the benchmark.
|
||
|
All measurements are in elapsed seconds on an unloaded machine.
|
||
|
<P>
|
||
|
<TABLE BORDER ALIGN="CENTER">
|
||
|
<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
|
||
|
<TH>2 marker threads (secs.)</th></tr>
|
||
|
<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
|
||
|
<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
|
||
|
</table>
|
||
|
<PP>
|
||
|
The execution time for the single threaded case is slightly worse than with
|
||
|
simple locking. However, even the single-threaded benchmark runs faster than
|
||
|
even the thread-unsafe version if a second processor is available.
|
||
|
The execution time for two clients with thread local allocation time is
|
||
|
only 1.4 times the sequential execution time for a single thread in a
|
||
|
thread-unsafe environment, even though it involves twice the client work.
|
||
|
That represents close to a
|
||
|
factor of 2 improvement over the 2 client case with the old collector.
|
||
|
The old collector clearly
|
||
|
still suffered from some contention overhead, in spite of the fact that the
|
||
|
locking scheme had been fairly well tuned.
|
||
|
<P>
|
||
|
Full linear speedup (i.e. the same execution time for 1 client on one
|
||
|
processor as 2 clients on 2 processors)
|
||
|
is probably not achievable on this kind of
|
||
|
hardware even with such a small number of processors,
|
||
|
since the memory system is
|
||
|
a major constraint for the garbage collector,
|
||
|
the processors usually share a single memory bus, and thus
|
||
|
the aggregate memory bandwidth does not increase in
|
||
|
proportion to the number of processors.
|
||
|
<P>
|
||
|
These results are likely to be very sensitive to both hardware and OS
|
||
|
issues. Preliminary experiments with an older Pentium Pro machine running
|
||
|
an older kernel were far less encouraging.
|
||
|
|
||
|
</body>
|
||
|
</html>
|