1d77bc548a
2013-07-04 Veres Lajos <vlajos@gmail.com> Jonathan Wakely <jwakely.gcc@gmail.com> * config/locale/generic/codecvt_members.cc: Fix typo. * configure.host: Likewise. * doc/html/manual/policy_data_structures_design.html: Likewise. * doc/xml/manual/policy_data_structures.xml: Likewise. * include/bits/hashtable.h: Likewise. * include/bits/random.h: Likewise. * include/profile/impl/profiler_trace.h: Likewise. * testsuite/23_containers/deque/cons/2.cc: Likewise. * testsuite/23_containers/deque/debug/shrink_to_fit.cc: Likewise. * testsuite/ext/pb_ds/example/basic_multimap.cc: Likewise. * testsuite/performance/23_containers/insert_erase/41975.cc: Likewise. Co-Authored-By: Jonathan Wakely <jwakely.gcc@gmail.com> From-SVN: r200681
5111 lines
181 KiB
XML
5111 lines
181 KiB
XML
<chapter xmlns="http://docbook.org/ns/docbook" version="5.0"
|
||
xml:id="manual.ext.containers.pbds" xreflabel="pbds">
|
||
<info>
|
||
<title>Policy-Based Data Structures</title>
|
||
<keywordset>
|
||
<keyword>ISO C++</keyword>
|
||
<keyword>policy</keyword>
|
||
<keyword>container</keyword>
|
||
<keyword>data</keyword>
|
||
<keyword>structure</keyword>
|
||
<keyword>associated</keyword>
|
||
<keyword>tree</keyword>
|
||
<keyword>trie</keyword>
|
||
<keyword>hash</keyword>
|
||
<keyword>metaprogramming</keyword>
|
||
</keywordset>
|
||
</info>
|
||
<?dbhtml filename="policy_data_structures.html"?>
|
||
|
||
<!-- 2006-04-01 Ami Tavory -->
|
||
<!-- 2011-05-25 Benjamin Kosnik -->
|
||
|
||
<!-- S01: intro -->
|
||
<section xml:id="pbds.intro">
|
||
<info><title>Intro</title></info>
|
||
|
||
<para>
|
||
This is a library of policy-based elementary data structures:
|
||
associative containers and priority queues. It is designed for
|
||
high-performance, flexibility, semantic safety, and conformance to
|
||
the corresponding containers in <literal>std</literal> and
|
||
<literal>std::tr1</literal> (except for some points where it differs
|
||
by design).
|
||
</para>
|
||
<para>
|
||
</para>
|
||
|
||
<section xml:id="pbds.intro.issues">
|
||
<info><title>Performance Issues</title></info>
|
||
<para>
|
||
</para>
|
||
|
||
<para>
|
||
An attempt is made to categorize the wide variety of possible
|
||
container designs in terms of performance-impacting factors. These
|
||
performance factors are translated into design policies and
|
||
incorporated into container design.
|
||
</para>
|
||
|
||
<para>
|
||
There is tension between unravelling factors into a coherent set of
|
||
policies. Every attempt is made to make a minimal set of
|
||
factors. However, in many cases multiple factors make for long
|
||
template names. Every attempt is made to alias and use typedefs in
|
||
the source files, but the generated names for external symbols can
|
||
be large for binary files or debuggers.
|
||
</para>
|
||
|
||
<para>
|
||
In many cases, the longer names allow capabilities and behaviours
|
||
controlled by macros to also be unamibiguously emitted as distinct
|
||
generated names.
|
||
</para>
|
||
|
||
<para>
|
||
Specific issues found while unraveling performance factors in the
|
||
design of associative containers and priority queues follow.
|
||
</para>
|
||
|
||
<section xml:id="pbds.intro.issues.associative">
|
||
<info><title>Associative</title></info>
|
||
|
||
<para>
|
||
Associative containers depend on their composite policies to a very
|
||
large extent. Implicitly hard-wiring policies can hamper their
|
||
performance and limit their functionality. An efficient hash-based
|
||
container, for example, requires policies for testing key
|
||
equivalence, hashing keys, translating hash values into positions
|
||
within the hash table, and determining when and how to resize the
|
||
table internally. A tree-based container can efficiently support
|
||
order statistics, i.e. the ability to query what is the order of
|
||
each key within the sequence of keys in the container, but only if
|
||
the container is supplied with a policy to internally update
|
||
meta-data. There are many other such examples.
|
||
</para>
|
||
|
||
<para>
|
||
Ideally, all associative containers would share the same
|
||
interface. Unfortunately, underlying data structures and mapping
|
||
semantics differentiate between different containers. For example,
|
||
suppose one writes a generic function manipulating an associative
|
||
container.
|
||
</para>
|
||
|
||
<programlisting>
|
||
template<typename Cntnr>
|
||
void
|
||
some_op_sequence(Cntnr& r_cnt)
|
||
{
|
||
...
|
||
}
|
||
</programlisting>
|
||
|
||
<para>
|
||
Given this, then what can one assume about the instantiating
|
||
container? The answer varies according to its underlying data
|
||
structure. If the underlying data structure of
|
||
<literal>Cntnr</literal> is based on a tree or trie, then the order
|
||
of elements is well defined; otherwise, it is not, in general. If
|
||
the underlying data structure of <literal>Cntnr</literal> is based
|
||
on a collision-chaining hash table, then modifying
|
||
r_<literal>Cntnr</literal> will not invalidate its iterators' order;
|
||
if the underlying data structure is a probing hash table, then this
|
||
is not the case. If the underlying data structure is based on a tree
|
||
or trie, then a reference to the container can efficiently be split;
|
||
otherwise, it cannot, in general. If the underlying data structure
|
||
is a red-black tree, then splitting a reference to the container is
|
||
exception-free; if it is an ordered-vector tree, exceptions can be
|
||
thrown.
|
||
</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.intro.issues.priority_queue">
|
||
<info><title>Priority Que</title></info>
|
||
|
||
<para>
|
||
Priority queues are useful when one needs to efficiently access a
|
||
minimum (or maximum) value as the set of values changes.
|
||
</para>
|
||
|
||
<para>
|
||
Most useful data structures for priority queues have a relatively
|
||
simple structure, as they are geared toward relatively simple
|
||
requirements. Unfortunately, these structures do not support access
|
||
to an arbitrary value, which turns out to be necessary in many
|
||
algorithms. Say, decreasing an arbitrary value in a graph
|
||
algorithm. Therefore, some extra mechanism is necessary and must be
|
||
invented for accessing arbitrary values. There are at least two
|
||
alternatives: embedding an associative container in a priority
|
||
queue, or allowing cross-referencing through iterators. The first
|
||
solution adds significant overhead; the second solution requires a
|
||
precise definition of iterator invalidation. Which is the next
|
||
point...
|
||
</para>
|
||
|
||
<para>
|
||
Priority queues, like hash-based containers, store values in an
|
||
order that is meaningless and undefined externally. For example, a
|
||
<code>push</code> operation can internally reorganize the
|
||
values. Because of this characteristic, describing a priority
|
||
queues' iterator is difficult: on one hand, the values to which
|
||
iterators point can remain valid, but on the other, the logical
|
||
order of iterators can change unpredictably.
|
||
</para>
|
||
|
||
<para>
|
||
Roughly speaking, any element that is both inserted to a priority
|
||
queue (e.g. through <code>push</code>) and removed
|
||
from it (e.g., through <code>pop</code>), incurs a
|
||
logarithmic overhead (in the amortized sense). Different underlying
|
||
data structures place the actual cost differently: some are
|
||
optimized for amortized complexity, whereas others guarantee that
|
||
specific operations only have a constant cost. One underlying data
|
||
structure might be chosen if modifying a value is frequent
|
||
(Dijkstra's shortest-path algorithm), whereas a different one might
|
||
be chosen otherwise. Unfortunately, an array-based binary heap - an
|
||
underlying data structure that optimizes (in the amortized sense)
|
||
<code>push</code> and <code>pop</code> operations, differs from the
|
||
others in terms of its invalidation guarantees. Other design
|
||
decisions also impact the cost and placement of the overhead, at the
|
||
expense of more difference in the the kinds of operations that the
|
||
underlying data structure can support. These differences pose a
|
||
challenge when creating a uniform interface for priority queues.
|
||
</para>
|
||
</section>
|
||
</section>
|
||
|
||
<section xml:id="pbds.intro.motivation">
|
||
<info><title>Goals</title></info>
|
||
|
||
<para>
|
||
Many fine associative-container libraries were already written,
|
||
most notably, the C++ standard's associative containers. Why
|
||
then write another library? This section shows some possible
|
||
advantages of this library, when considering the challenges in
|
||
the introduction. Many of these points stem from the fact that
|
||
the ISO C++ process introduced associative-containers in a
|
||
two-step process (first standardizing tree-based containers,
|
||
only then adding hash-based containers, which are fundamentally
|
||
different), did not standardize priority queues as containers,
|
||
and (in our opinion) overloads the iterator concept.
|
||
</para>
|
||
|
||
<section xml:id="pbds.intro.motivation.associative">
|
||
<info><title>Associative</title></info>
|
||
<para>
|
||
</para>
|
||
|
||
<section xml:id="motivation.associative.policy">
|
||
<info><title>Policy Choices</title></info>
|
||
<para>
|
||
Associative containers require a relatively large number of
|
||
policies to function efficiently in various settings. In some
|
||
cases this is needed for making their common operations more
|
||
efficient, and in other cases this allows them to support a
|
||
larger set of operations
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Hash-based containers, for example, support look-up and
|
||
insertion methods (<function>find</function> and
|
||
<function>insert</function>). In order to locate elements
|
||
quickly, they are supplied a hash functor, which instruct
|
||
how to transform a key object into some size type; a hash
|
||
functor might transform <constant>"hello"</constant>
|
||
into <constant>1123002298</constant>. A hash table, though,
|
||
requires transforming each key object into some size-type
|
||
type in some specific domain; a hash table with a 128-long
|
||
table might transform <constant>"hello"</constant> into
|
||
position <constant>63</constant>. The policy by which the
|
||
hash value is transformed into a position within the table
|
||
can dramatically affect performance. Hash-based containers
|
||
also do not resize naturally (as opposed to tree-based
|
||
containers, for example). The appropriate resize policy is
|
||
unfortunately intertwined with the policy that transforms
|
||
hash value into a position within the table.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Tree-based containers, for example, also support look-up and
|
||
insertion methods, and are primarily useful when maintaining
|
||
order between elements is important. In some cases, though,
|
||
one can utilize their balancing algorithms for completely
|
||
different purposes.
|
||
</para>
|
||
|
||
<para>
|
||
Figure A shows a tree whose each node contains two entries:
|
||
a floating-point key, and some size-type
|
||
<emphasis>metadata</emphasis> (in bold beneath it) that is
|
||
the number of nodes in the sub-tree. (The root has key 0.99,
|
||
and has 5 nodes (including itself) in its sub-tree.) A
|
||
container based on this data structure can obviously answer
|
||
efficiently whether 0.3 is in the container object, but it
|
||
can also answer what is the order of 0.3 among all those in
|
||
the container object: see <xref linkend="biblio.clrs2001"/>.
|
||
|
||
</para>
|
||
|
||
<para>
|
||
As another example, Figure B shows a tree whose each node
|
||
contains two entries: a half-open geometric line interval,
|
||
and a number <emphasis>metadata</emphasis> (in bold beneath
|
||
it) that is the largest endpoint of all intervals in its
|
||
sub-tree. (The root describes the interval <constant>[20,
|
||
36)</constant>, and the largest endpoint in its sub-tree is
|
||
99.) A container based on this data structure can obviously
|
||
answer efficiently whether <constant>[3, 41)</constant> is
|
||
in the container object, but it can also answer efficiently
|
||
whether the container object has intervals that intersect
|
||
<constant>[3, 41)</constant>. These types of queries are
|
||
very useful in geometric algorithms and lease-management
|
||
algorithms.
|
||
</para>
|
||
|
||
<para>
|
||
It is important to note, however, that as the trees are
|
||
modified, their internal structure changes. To maintain
|
||
these invariants, one must supply some policy that is aware
|
||
of these changes. Without this, it would be better to use a
|
||
linked list (in itself very efficient for these purposes).
|
||
</para>
|
||
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<figure>
|
||
<title>Node Invariants</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_node_invariants.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Node Invariants</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
</section>
|
||
|
||
<section xml:id="motivation.associative.underlying">
|
||
<info><title>Underlying Data Structures</title></info>
|
||
<para>
|
||
The standard C++ library contains associative containers based on
|
||
red-black trees and collision-chaining hash tables. These are
|
||
very useful, but they are not ideal for all types of
|
||
settings.
|
||
</para>
|
||
|
||
<para>
|
||
The figure below shows the different underlying data structures
|
||
currently supported in this library.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Underlying Associative Data Structures</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_different_underlying_dss_1.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Underlying Associative Data Structures</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
A shows a collision-chaining hash-table, B shows a probing
|
||
hash-table, C shows a red-black tree, D shows a splay tree, E shows
|
||
a tree based on an ordered vector(implicit in the order of the
|
||
elements), F shows a PATRICIA trie, and G shows a list-based
|
||
container with update policies.
|
||
</para>
|
||
|
||
<para>
|
||
Each of these data structures has some performance benefits, in
|
||
terms of speed, size or both. For now, note that vector-based trees
|
||
and probing hash tables manipulate memory more efficiently than
|
||
red-black trees and collision-chaining hash tables, and that
|
||
list-based associative containers are very useful for constructing
|
||
"multimaps".
|
||
</para>
|
||
|
||
<para>
|
||
Now consider a function manipulating a generic associative
|
||
container,
|
||
</para>
|
||
<programlisting>
|
||
template<class Cntnr>
|
||
int
|
||
some_op_sequence(Cntnr &r_cnt)
|
||
{
|
||
...
|
||
}
|
||
</programlisting>
|
||
|
||
<para>
|
||
Ideally, the underlying data structure
|
||
of <classname>Cntnr</classname> would not affect what can be
|
||
done with <varname>r_cnt</varname>. Unfortunately, this is not
|
||
the case.
|
||
</para>
|
||
|
||
<para>
|
||
For example, if <classname>Cntnr</classname>
|
||
is <classname>std::map</classname>, then the function can
|
||
use
|
||
</para>
|
||
<programlisting>
|
||
std::for_each(r_cnt.find(foo), r_cnt.find(bar), foobar)
|
||
</programlisting>
|
||
<para>
|
||
in order to apply <classname>foobar</classname> to all
|
||
elements between <classname>foo</classname> and
|
||
<classname>bar</classname>. If
|
||
<classname>Cntnr</classname> is a hash-based container,
|
||
then this call's results are undefined.
|
||
</para>
|
||
|
||
<para>
|
||
Also, if <classname>Cntnr</classname> is tree-based, the type
|
||
and object of the comparison functor can be
|
||
accessed. If <classname>Cntnr</classname> is hash based, these
|
||
queries are nonsensical.
|
||
</para>
|
||
|
||
<para>
|
||
There are various other differences based on the container's
|
||
underlying data structure. For one, they can be constructed by,
|
||
and queried for, different policies. Furthermore:
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Containers based on C, D, E and F store elements in a
|
||
meaningful order; the others store elements in a meaningless
|
||
(and probably time-varying) order. By implication, only
|
||
containers based on C, D, E and F can
|
||
support <function>erase</function> operations taking an
|
||
iterator and returning an iterator to the following element
|
||
without performance loss.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Containers based on C, D, E, and F can be split and joined
|
||
efficiently, while the others cannot. Containers based on C
|
||
and D, furthermore, can guarantee that this is exception-free;
|
||
containers based on E cannot guarantee this.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Containers based on all but E can guarantee that
|
||
erasing an element is exception free; containers based on E
|
||
cannot guarantee this. Containers based on all but B and E
|
||
can guarantee that modifying an object of their type does
|
||
not invalidate iterators or references to their elements,
|
||
while containers based on B and E cannot. Containers based
|
||
on C, D, and E can furthermore make a stronger guarantee,
|
||
namely that modifying an object of their type does not
|
||
affect the order of iterators.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<para>
|
||
A unified tag and traits system (as used for the C++ standard
|
||
library iterators, for example) can ease generic manipulation of
|
||
associative containers based on different underlying data
|
||
structures.
|
||
</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="motivation.associative.iterators">
|
||
<info><title>Iterators</title></info>
|
||
<para>
|
||
Iterators are centric to the design of the standard library
|
||
containers, because of the container/algorithm/iterator
|
||
decomposition that allows an algorithm to operate on a range
|
||
through iterators of some sequence. Iterators, then, are useful
|
||
because they allow going over a
|
||
specific <emphasis>sequence</emphasis>. The standard library
|
||
also uses iterators for accessing a
|
||
specific <emphasis>element</emphasis>: when an associative
|
||
container returns one through <function>find</function>. The
|
||
standard library consistently uses the same types of iterators
|
||
for both purposes: going over a range, and accessing a specific
|
||
found element. Before the introduction of hash-based containers
|
||
to the standard library, this made sense (with the exception of
|
||
priority queues, which are discussed later).
|
||
</para>
|
||
|
||
<para>
|
||
Using the standard associative containers together with
|
||
non-order-preserving associative containers (and also because of
|
||
priority-queues container), there is a possible need for
|
||
different types of iterators for self-organizing containers:
|
||
the iterator concept seems overloaded to mean two different
|
||
things (in some cases). <remark> XXX
|
||
"ds_gen.html#find_range">Design::Associative
|
||
Containers::Data-Structure Genericity::Point-Type and Range-Type
|
||
Methods</remark>.
|
||
</para>
|
||
|
||
<section xml:id="associative.iterators.using">
|
||
<info>
|
||
<title>Using Point Iterators for Range Operations</title>
|
||
</info>
|
||
<para>
|
||
Suppose <classname>cntnr</classname> is some associative
|
||
container, and say <varname>c</varname> is an object of
|
||
type <classname>cntnr</classname>. Then what will be the outcome
|
||
of
|
||
</para>
|
||
|
||
<programlisting>
|
||
std::for_each(c.find(1), c.find(5), foo);
|
||
</programlisting>
|
||
|
||
<para>
|
||
If <classname>cntnr</classname> is a tree-based container
|
||
object, then an in-order walk will
|
||
apply <classname>foo</classname> to the relevant elements,
|
||
as in the graphic below, label A. If <varname>c</varname> is
|
||
a hash-based container, then the order of elements between any
|
||
two elements is undefined (and probably time-varying); there is
|
||
no guarantee that the elements traversed will coincide with the
|
||
<emphasis>logical</emphasis> elements between 1 and 5, as in
|
||
label B.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Range Iteration in Different Data Structures</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_point_iterators_range_ops_1.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Node Invariants</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
In our opinion, this problem is not caused just because
|
||
red-black trees are order preserving while
|
||
collision-chaining hash tables are (generally) not - it
|
||
is more fundamental. Most of the standard's containers
|
||
order sequences in a well-defined manner that is
|
||
determined by their <emphasis>interface</emphasis>:
|
||
calling <function>insert</function> on a tree-based
|
||
container modifies its sequence in a predictable way, as
|
||
does calling <function>push_back</function> on a list or
|
||
a vector. Conversely, collision-chaining hash tables,
|
||
probing hash tables, priority queues, and list-based
|
||
containers (which are very useful for "multimaps") are
|
||
self-organizing data structures; the effect of each
|
||
operation modifies their sequences in a manner that is
|
||
(practically) determined by their
|
||
<emphasis>implementation</emphasis>.
|
||
</para>
|
||
|
||
<para>
|
||
Consequently, applying an algorithm to a sequence obtained from most
|
||
containers may or may not make sense, but applying it to a
|
||
sub-sequence of a self-organizing container does not.
|
||
</para>
|
||
</section>
|
||
|
||
<section xml:id="associative.iterators.cost">
|
||
<info>
|
||
<title>Cost to Point Iterators to Enable Range Operations</title>
|
||
</info>
|
||
<para>
|
||
Suppose <varname>c</varname> is some collision-chaining
|
||
hash-based container object, and one calls
|
||
</para>
|
||
<programlisting>c.find(3)</programlisting>
|
||
<para>
|
||
Then what composes the returned iterator?
|
||
</para>
|
||
|
||
<para>
|
||
In the graphic below, label A shows the simplest (and
|
||
most efficient) implementation of a collision-chaining
|
||
hash table. The little box marked
|
||
<classname>point_iterator</classname> shows an object
|
||
that contains a pointer to the element's node. Note that
|
||
this "iterator" has no way to move to the next element (
|
||
it cannot support
|
||
<function>operator++</function>). Conversely, the little
|
||
box marked <classname>iterator</classname> stores both a
|
||
pointer to the element, as well as some other
|
||
information (the bucket number of the element). the
|
||
second iterator, then, is "heavier" than the first one-
|
||
it requires more time and space. If we were to use a
|
||
different container to cross-reference into this
|
||
hash-table using these iterators - it would take much
|
||
more space. As noted above, nothing much can be done by
|
||
incrementing these iterators, so why is this extra
|
||
information needed?
|
||
</para>
|
||
|
||
<para>
|
||
Alternatively, one might create a collision-chaining hash-table
|
||
where the lists might be linked, forming a monolithic total-element
|
||
list, as in the graphic below, label B. Here the iterators are as
|
||
light as can be, but the hash-table's operations are more
|
||
complicated.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Point Iteration in Hash Data Structures</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_point_iterators_range_ops_2.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Point Iteration in Hash Data Structures</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
It should be noted that containers based on collision-chaining
|
||
hash-tables are not the only ones with this type of behavior;
|
||
many other self-organizing data structures display it as well.
|
||
</para>
|
||
</section>
|
||
|
||
<section xml:id="associative.iterators.invalidation">
|
||
<info><title>Invalidation Guarantees</title></info>
|
||
<para>Consider the following snippet:</para>
|
||
<programlisting>
|
||
it = c.find(3);
|
||
c.erase(5);
|
||
</programlisting>
|
||
|
||
<para>
|
||
Following the call to <classname>erase</classname>, what is the
|
||
validity of <classname>it</classname>: can it be de-referenced?
|
||
can it be incremented?
|
||
</para>
|
||
|
||
<para>
|
||
The answer depends on the underlying data structure of the
|
||
container. The graphic below shows three cases: A1 and A2 show
|
||
a red-black tree; B1 and B2 show a probing hash-table; C1 and C2
|
||
show a collision-chaining hash table.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Effect of erase in different underlying data structures</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_invalidation_guarantee_erase.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Effect of erase in different underlying data structures</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Erasing 5 from A1 yields A2. Clearly, an iterator to 3 can
|
||
be de-referenced and incremented. The sequence of iterators
|
||
changed, but in a way that is well-defined by the interface.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Erasing 5 from B1 yields B2. Clearly, an iterator to 3 is
|
||
not valid at all - it cannot be de-referenced or
|
||
incremented; the order of iterators changed in a way that is
|
||
(practically) determined by the implementation and not by
|
||
the interface.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Erasing 5 from C1 yields C2. Here the situation is more
|
||
complicated. On the one hand, there is no problem in
|
||
de-referencing <classname>it</classname>. On the other hand,
|
||
the order of iterators changed in a way that is
|
||
(practically) determined by the implementation and not by
|
||
the interface.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<para>
|
||
So in the standard library containers, it is not always possible
|
||
to express whether <varname>it</varname> is valid or not. This
|
||
is true also for <function>insert</function>. Again, the
|
||
iterator concept seems overloaded.
|
||
</para>
|
||
</section>
|
||
</section> <!--iterators-->
|
||
|
||
|
||
<section xml:id="motivation.associative.functions">
|
||
<info><title>Functional</title></info>
|
||
<para>
|
||
</para>
|
||
|
||
<para>
|
||
The design of the functional overlay to the underlying data
|
||
structures differs slightly from some of the conventions used in
|
||
the C++ standard. A strict public interface of methods that
|
||
comprise only operations which depend on the class's internal
|
||
structure; other operations are best designed as external
|
||
functions. (See <xref linkend="biblio.meyers02both"/>).With this
|
||
rubric, the standard associative containers lack some useful
|
||
methods, and provide other methods which would be better
|
||
removed.
|
||
</para>
|
||
|
||
<section xml:id="motivation.associative.functions.erase">
|
||
<info><title><function>erase</function></title></info>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Order-preserving standard associative containers provide the
|
||
method
|
||
</para>
|
||
<programlisting>
|
||
iterator
|
||
erase(iterator it)
|
||
</programlisting>
|
||
|
||
<para>
|
||
which takes an iterator, erases the corresponding
|
||
element, and returns an iterator to the following
|
||
element. Also standardd hash-based associative
|
||
containers provide this method. This seemingly
|
||
increasesgenericity between associative containers,
|
||
since it is possible to use
|
||
</para>
|
||
<programlisting>
|
||
typename C::iterator it = c.begin();
|
||
typename C::iterator e_it = c.end();
|
||
|
||
while(it != e_it)
|
||
it = pred(*it)? c.erase(it) : ++it;
|
||
</programlisting>
|
||
|
||
<para>
|
||
in order to erase from a container object <varname>
|
||
c</varname> all element which match a
|
||
predicate <classname>pred</classname>. However, in a
|
||
different sense this actually decreases genericity: an
|
||
integral implication of this method is that tree-based
|
||
associative containers' memory use is linear in the total
|
||
number of elements they store, while hash-based
|
||
containers' memory use is unbounded in the total number of
|
||
elements they store. Assume a hash-based container is
|
||
allowed to decrease its size when an element is
|
||
erased. Then the elements might be rehashed, which means
|
||
that there is no "next" element - it is simply
|
||
undefined. Consequently, it is possible to infer from the
|
||
fact that the standard library's hash-based containers
|
||
provide this method that they cannot downsize when
|
||
elements are erased. As a consequence, different code is
|
||
needed to manipulate different containers, assuming that
|
||
memory should be conserved. Therefor, this library's
|
||
non-order preserving associative containers omit this
|
||
method.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
All associative containers include a conditional-erase method
|
||
</para>
|
||
<programlisting>
|
||
template<
|
||
class Pred>
|
||
size_type
|
||
erase_if
|
||
(Pred pred)
|
||
</programlisting>
|
||
<para>
|
||
which erases all elements matching a predicate. This is probably the
|
||
only way to ensure linear-time multiple-item erase which can
|
||
actually downsize a container.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
The standard associative containers provide methods for
|
||
multiple-item erase of the form
|
||
</para>
|
||
<programlisting>
|
||
size_type
|
||
erase(It b, It e)
|
||
</programlisting>
|
||
<para>
|
||
erasing a range of elements given by a pair of
|
||
iterators. For tree-based or trie-based containers, this can
|
||
implemented more efficiently as a (small) sequence of split
|
||
and join operations. For other, unordered, containers, this
|
||
method isn't much better than an external loop. Moreover,
|
||
if <varname>c</varname> is a hash-based container,
|
||
then
|
||
</para>
|
||
<programlisting>
|
||
c.erase(c.find(2), c.find(5))
|
||
</programlisting>
|
||
<para>
|
||
is almost certain to do something
|
||
different than erasing all elements whose keys are between 2
|
||
and 5, and is likely to produce other undefined behavior.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
</section> <!-- erase -->
|
||
|
||
<section xml:id="motivation.associative.functions.split">
|
||
<info>
|
||
<title>
|
||
<function>split</function> and <function>join</function>
|
||
</title>
|
||
</info>
|
||
<para>
|
||
It is well-known that tree-based and trie-based container
|
||
objects can be efficiently split or joined (See
|
||
<xref linkend="biblio.clrs2001"/>). Externally splitting or
|
||
joining trees is super-linear, and, furthermore, can throw
|
||
exceptions. Split and join methods, consequently, seem good
|
||
choices for tree-based container methods, especially, since as
|
||
noted just before, they are efficient replacements for erasing
|
||
sub-sequences.
|
||
</para>
|
||
|
||
</section> <!-- split -->
|
||
|
||
<section xml:id="motivation.associative.functions.insert">
|
||
<info>
|
||
<title>
|
||
<function>insert</function>
|
||
</title>
|
||
</info>
|
||
<para>
|
||
The standard associative containers provide methods of the form
|
||
</para>
|
||
<programlisting>
|
||
template<class It>
|
||
size_type
|
||
insert(It b, It e);
|
||
</programlisting>
|
||
|
||
<para>
|
||
for inserting a range of elements given by a pair of
|
||
iterators. At best, this can be implemented as an external loop,
|
||
or, even more efficiently, as a join operation (for the case of
|
||
tree-based or trie-based containers). Moreover, these methods seem
|
||
similar to constructors taking a range given by a pair of
|
||
iterators; the constructors, however, are transactional, whereas
|
||
the insert methods are not; this is possibly confusing.
|
||
</para>
|
||
|
||
</section> <!-- insert -->
|
||
|
||
<section xml:id="motivation.associative.functions.compare">
|
||
<info>
|
||
<title>
|
||
<function>operator==</function> and <function>operator<=</function>
|
||
</title>
|
||
</info>
|
||
|
||
<para>
|
||
Associative containers are parametrized by policies allowing to
|
||
test key equivalence: a hash-based container can do this through
|
||
its equivalence functor, and a tree-based container can do this
|
||
through its comparison functor. In addition, some standard
|
||
associative containers have global function operators, like
|
||
<function>operator==</function> and <function>operator<=</function>,
|
||
that allow comparing entire associative containers.
|
||
</para>
|
||
|
||
<para>
|
||
In our opinion, these functions are better left out. To begin
|
||
with, they do not significantly improve over an external
|
||
loop. More importantly, however, they are possibly misleading -
|
||
<function>operator==</function>, for example, usually checks for
|
||
equivalence, or interchangeability, but the associative
|
||
container cannot check for values' equivalence, only keys'
|
||
equivalence; also, are two containers considered equivalent if
|
||
they store the same values in different order? this is an
|
||
arbitrary decision.
|
||
</para>
|
||
</section> <!-- compare -->
|
||
|
||
</section> <!-- functional -->
|
||
|
||
</section> <!--associative-->
|
||
|
||
<section xml:id="pbds.intro.motivation.priority_queue">
|
||
<info><title>Priority Queues</title></info>
|
||
|
||
<section xml:id="motivation.priority_queue.policy">
|
||
<info><title>Policy Choices</title></info>
|
||
|
||
<para>
|
||
Priority queues are containers that allow efficiently inserting
|
||
values and accessing the maximal value (in the sense of the
|
||
container's comparison functor). Their interface
|
||
supports <function>push</function>
|
||
and <function>pop</function>. The standard
|
||
container <classname>std::priorityqueue</classname> indeed support
|
||
these methods, but little else. For algorithmic and
|
||
software-engineering purposes, other methods are needed:
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Many graph algorithms (see
|
||
<xref linkend="biblio.clrs2001"/>) require increasing a
|
||
value in a priority queue (again, in the sense of the
|
||
container's comparison functor), or joining two
|
||
priority-queue objects.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>The return type of <classname>priority_queue</classname>'s
|
||
<function>push</function> method is a point-type iterator, which can
|
||
be used for modifying or erasing arbitrary values. For
|
||
example:</para>
|
||
<programlisting>
|
||
priority_queue<int> p;
|
||
priority_queue<int>::point_iterator it = p.push(3);
|
||
p.modify(it, 4);
|
||
</programlisting>
|
||
|
||
<para>These types of cross-referencing operations are necessary
|
||
for making priority queues useful for different applications,
|
||
especially graph applications.</para>
|
||
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
It is sometimes necessary to erase an arbitrary value in a
|
||
priority queue. For example, consider
|
||
the <function>select</function> function for monitoring
|
||
file descriptors:
|
||
</para>
|
||
|
||
<programlisting>
|
||
int
|
||
select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds,
|
||
struct timeval *timeout);
|
||
</programlisting>
|
||
<para>
|
||
then, as the select documentation states:
|
||
</para>
|
||
<para>
|
||
<quote>
|
||
The nfds argument specifies the range of file
|
||
descriptors to be tested. The select() function tests file
|
||
descriptors in the range of 0 to nfds-1.</quote>
|
||
</para>
|
||
|
||
<para>
|
||
It stands to reason, therefore, that we might wish to
|
||
maintain a minimal value for <varname>nfds</varname>, and
|
||
priority queues immediately come to mind. Note, though, that
|
||
when a socket is closed, the minimal file description might
|
||
change; in the absence of an efficient means to erase an
|
||
arbitrary value from a priority queue, we might as well
|
||
avoid its use altogether.
|
||
</para>
|
||
|
||
<para>
|
||
The standard containers typically support iterators. It is
|
||
somewhat unusual
|
||
for <classname>std::priority_queue</classname> to omit them
|
||
(See <xref linkend="biblio.meyers01stl"/>). One might
|
||
ask why do priority queues need to support iterators, since
|
||
they are self-organizing containers with a different purpose
|
||
than abstracting sequences. There are several reasons:
|
||
</para>
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
Iterators (even in self-organizing containers) are
|
||
useful for many purposes: cross-referencing
|
||
containers, serialization, and debugging code that uses
|
||
these containers.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
The standard library's hash-based containers support
|
||
iterators, even though they too are self-organizing
|
||
containers with a different purpose than abstracting
|
||
sequences.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
In standard-library-like containers, it is natural to specify the
|
||
interface of operations for modifying a value or erasing
|
||
a value (discussed previously) in terms of a iterators.
|
||
It should be noted that the standard
|
||
containers also use iterators for accessing and
|
||
manipulating a specific value. In hash-based
|
||
containers, one checks the existence of a key by
|
||
comparing the iterator returned by <function>find</function> to the
|
||
iterator returned by <function>end</function>, and not by comparing a
|
||
pointer returned by <function>find</function> to <type>NULL</type>.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="motivation.priority_queue.underlying">
|
||
<info><title>Underlying Data Structures</title></info>
|
||
|
||
<para>
|
||
There are three main implementations of priority queues: the
|
||
first employs a binary heap, typically one which uses a
|
||
sequence; the second uses a tree (or forest of trees), which is
|
||
typically less structured than an associative container's tree;
|
||
the third simply uses an associative container. These are
|
||
shown in the figure below with labels A1 and A2, B, and C.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Underlying Priority Queue Data Structures</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_different_underlying_dss_2.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Underlying Priority Queue Data Structures</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
No single implementation can completely replace any of the
|
||
others. Some have better <function>push</function>
|
||
and <function>pop</function> amortized performance, some have
|
||
better bounded (worst case) response time than others, some
|
||
optimize a single method at the expense of others, etc. In
|
||
general the "best" implementation is dictated by the specific
|
||
problem.
|
||
</para>
|
||
|
||
<para>
|
||
As with associative containers, the more implementations
|
||
co-exist, the more necessary a traits mechanism is for handling
|
||
generic containers safely and efficiently. This is especially
|
||
important for priority queues, since the invalidation guarantees
|
||
of one of the most useful data structures - binary heaps - is
|
||
markedly different than those of most of the others.
|
||
</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="motivation.priority_queue.binary_heap">
|
||
<info><title>Binary Heaps</title></info>
|
||
|
||
|
||
<para>
|
||
Binary heaps are one of the most useful underlying
|
||
data structures for priority queues. They are very efficient in
|
||
terms of memory (since they don't require per-value structure
|
||
metadata), and have the best amortized <function>push</function> and
|
||
<function>pop</function> performance for primitive types like
|
||
<type>int</type>.
|
||
</para>
|
||
|
||
<para>
|
||
The standard library's <classname>priority_queue</classname>
|
||
implements this data structure as an adapter over a sequence,
|
||
typically
|
||
<classname>std::vector</classname>
|
||
or <classname>std::deque</classname>, which correspond to labels
|
||
A1 and A2 respectively in the graphic above.
|
||
</para>
|
||
|
||
<para>
|
||
This is indeed an elegant example of the adapter concept and
|
||
the algorithm/container/iterator decomposition. (See <xref linkend="biblio.nelson96stlpq"/>). There are
|
||
several reasons why a binary-heap priority queue
|
||
may be better implemented as a container instead of a
|
||
sequence adapter:
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>std::priority_queue</classname> cannot erase values
|
||
from its adapted sequence (irrespective of the sequence
|
||
type). This means that the memory use of
|
||
an <classname>std::priority_queue</classname> object is always
|
||
proportional to the maximal number of values it ever contained,
|
||
and not to the number of values that it currently
|
||
contains. (See <filename>performance/priority_queue_text_pop_mem_usage.cc</filename>.)
|
||
This implementation of binary heaps acts very differently than
|
||
other underlying data structures (See also pairing heaps).
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Some combinations of adapted sequences and value types
|
||
are very inefficient or just don't make sense. If one uses
|
||
<classname>std::priority_queue<std::vector<std::string>
|
||
> ></classname>, for example, then not only will each
|
||
operation perform a logarithmic number of
|
||
<classname>std::string</classname> assignments, but, furthermore, any
|
||
operation (including <function>pop</function>) can render the container
|
||
useless due to exceptions. Conversely, if one uses
|
||
<classname>std::priority_queue<std::deque<int> >
|
||
></classname>, then each operation uses incurs a logarithmic
|
||
number of indirect accesses (through pointers) unnecessarily.
|
||
It might be better to let the container make a conservative
|
||
deduction whether to use the structure in the graphic above, labels A1 or A2.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
There does not seem to be a systematic way to determine
|
||
what exactly can be done with the priority queue.
|
||
</para>
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
If <classname>p</classname> is a priority queue adapting an
|
||
<classname>std::vector</classname>, then it is possible to iterate over
|
||
all values by using <function>&p.top()</function> and
|
||
<function>&p.top() + p.size()</function>, but this will not work
|
||
if <varname>p</varname> is adapting an <classname>std::deque</classname>; in any
|
||
case, one cannot use <classname>p.begin()</classname> and
|
||
<classname>p.end()</classname>. If a different sequence is adapted, it
|
||
is even more difficult to determine what can be
|
||
done.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
If <varname>p</varname> is a priority queue adapting an
|
||
<classname>std::deque</classname>, then the reference return by
|
||
</para>
|
||
<programlisting>
|
||
p.top()
|
||
</programlisting>
|
||
<para>
|
||
will remain valid until it is popped,
|
||
but if <varname>p</varname> adapts an <classname>std::vector</classname>, the
|
||
next <function>push</function> will invalidate it. If a different
|
||
sequence is adapted, it is even more difficult to
|
||
determine what can be done.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Sequence-based binary heaps can still implement
|
||
linear-time <function>erase</function> and <function>modify</function> operations.
|
||
This means that if one needs to erase a small
|
||
(say logarithmic) number of values, then one might still
|
||
choose this underlying data structure. Using
|
||
<classname>std::priority_queue</classname>, however, this will generally
|
||
change the order of growth of the entire sequence of
|
||
operations.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
</section>
|
||
</section>
|
||
</section> <!-- goals/motivation -->
|
||
</section> <!-- intro -->
|
||
|
||
<!-- S02: Using -->
|
||
<section xml:id="containers.pbds.using">
|
||
<info><title>Using</title></info>
|
||
<?dbhtml filename="policy_data_structures_using.html"?>
|
||
|
||
<section xml:id="pbds.using.prereq">
|
||
<info><title>Prerequisites</title></info>
|
||
|
||
<para>The library contains only header files, and does not require any
|
||
other libraries except the standard C++ library . All classes are
|
||
defined in namespace <code>__gnu_pbds</code>. The library internally
|
||
uses macros beginning with <code>PB_DS</code>, but
|
||
<code>#undef</code>s anything it <code>#define</code>s (except for
|
||
header guards). Compiling the library in an environment where macros
|
||
beginning in <code>PB_DS</code> are defined, may yield unpredictable
|
||
results in compilation, execution, or both.</para>
|
||
|
||
<para>
|
||
Further dependencies are necessary to create the visual output
|
||
for the performance tests. To create these graphs, an
|
||
additional package is needed: <command>pychart</command>.
|
||
</para>
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.organization">
|
||
<info><title>Organization</title></info>
|
||
|
||
<para>
|
||
The various data structures are organized as follows.
|
||
</para>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Branch-Based
|
||
</para>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>basic_branch</classname>
|
||
is an abstract base class for branched-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>tree</classname>
|
||
is a concrete base class for tree-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>trie</classname>
|
||
is a concrete base class trie-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Hash-Based
|
||
</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>basic_hash_table</classname>
|
||
is an abstract base class for hash-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>cc_hash_table</classname>
|
||
is a concrete collision-chaining hash-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>gp_hash_table</classname>
|
||
is a concrete (general) probing hash-based
|
||
associative-containers
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
List-Based
|
||
</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>list_update</classname>
|
||
list-based update-policy associative container
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
Heap-Based
|
||
</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>priority_queue</classname>
|
||
A priority queue.
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
<para>
|
||
The hierarchy is composed naturally so that commonality is
|
||
captured by base classes. Thus <function>operator[]</function>
|
||
is defined at the base of any hierarchy, since all derived
|
||
containers support it. Conversely <function>split</function> is
|
||
defined in <classname>basic_branch</classname>, since only
|
||
tree-like containers support it.
|
||
</para>
|
||
|
||
<para>
|
||
In addition, there are the following diagnostics classes,
|
||
used to report errors specific to this library's data
|
||
structures.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Exception Hierarchy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PDF" scale="75"
|
||
fileref="../images/pbds_exception_hierarchy.pdf"/>
|
||
</imageobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_exception_hierarchy.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Exception Hierarchy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.tutorial">
|
||
<info><title>Tutorial</title></info>
|
||
|
||
<section xml:id="pbds.using.tutorial.basic">
|
||
<info><title>Basic Use</title></info>
|
||
|
||
<para>
|
||
For the most part, the policy-based containers containers in
|
||
namespace <literal>__gnu_pbds</literal> have the same interface as
|
||
the equivalent containers in the standard C++ library, except for
|
||
the names used for the container classes themselves. For example,
|
||
this shows basic operations on a collision-chaining hash-based
|
||
container:
|
||
</para>
|
||
<programlisting>
|
||
#include <ext/pb_ds/assoc_container.h>
|
||
|
||
int main()
|
||
{
|
||
__gnu_pbds::cc_hash_table<int, char> c;
|
||
c[2] = 'b';
|
||
assert(c.find(1) == c.end());
|
||
};
|
||
</programlisting>
|
||
|
||
<para>
|
||
The container is called
|
||
<classname>__gnu_pbds::cc_hash_table</classname> instead of
|
||
<classname>std::unordered_map</classname>, since <quote>unordered
|
||
map</quote> does not necessarily mean a hash-based map as implied by
|
||
the C++ library (C++11 or TR1). For example, list-based associative
|
||
containers, which are very useful for the construction of
|
||
"multimaps," are also unordered.
|
||
</para>
|
||
|
||
<para>This snippet shows a red-black tree based container:</para>
|
||
|
||
<programlisting>
|
||
#include <ext/pb_ds/assoc_container.h>
|
||
|
||
int main()
|
||
{
|
||
__gnu_pbds::tree<int, char> c;
|
||
c[2] = 'b';
|
||
assert(c.find(2) != c.end());
|
||
};
|
||
</programlisting>
|
||
|
||
<para>The container is called <classname>tree</classname> instead of
|
||
<classname>map</classname> since the underlying data structures are
|
||
being named with specificity.
|
||
</para>
|
||
|
||
<para>
|
||
The member function naming convention is to strive to be the same as
|
||
the equivalent member functions in other C++ standard library
|
||
containers. The familiar methods are unchanged:
|
||
<function>begin</function>, <function>end</function>,
|
||
<function>size</function>, <function>empty</function>, and
|
||
<function>clear</function>.
|
||
</para>
|
||
|
||
<para>
|
||
This isn't to say that things are exactly as one would expect, given
|
||
the container requirments and interfaces in the C++ standard.
|
||
</para>
|
||
|
||
<para>
|
||
The names of containers' policies and policy accessors are
|
||
different then the usual. For example, if <type>hash_type</type> is
|
||
some type of hash-based container, then</para>
|
||
|
||
<programlisting>
|
||
hash_type::hash_fn
|
||
</programlisting>
|
||
|
||
<para>
|
||
gives the type of its hash functor, and if <varname>obj</varname> is
|
||
some hash-based container object, then
|
||
</para>
|
||
|
||
<programlisting>
|
||
obj.get_hash_fn()
|
||
</programlisting>
|
||
|
||
<para>will return a reference to its hash-functor object.</para>
|
||
|
||
|
||
<para>
|
||
Similarly, if <type>tree_type</type> is some type of tree-based
|
||
container, then
|
||
</para>
|
||
|
||
<programlisting>
|
||
tree_type::cmp_fn
|
||
</programlisting>
|
||
|
||
<para>
|
||
gives the type of its comparison functor, and if
|
||
<varname>obj</varname> is some tree-based container object,
|
||
then
|
||
</para>
|
||
|
||
<programlisting>
|
||
obj.get_cmp_fn()
|
||
</programlisting>
|
||
|
||
<para>will return a reference to its comparison-functor object.</para>
|
||
|
||
<para>
|
||
It would be nice to give names consistent with those in the existing
|
||
C++ standard (inclusive of TR1). Unfortunately, these standard
|
||
containers don't consistently name types and methods. For example,
|
||
<classname>std::tr1::unordered_map</classname> uses
|
||
<type>hasher</type> for the hash functor, but
|
||
<classname>std::map</classname> uses <type>key_compare</type> for
|
||
the comparison functor. Also, we could not find an accessor for
|
||
<classname>std::tr1::unordered_map</classname>'s hash functor, but
|
||
<classname>std::map</classname> uses <classname>compare</classname>
|
||
for accessing the comparison functor.
|
||
</para>
|
||
|
||
<para>
|
||
Instead, <literal>__gnu_pbds</literal> attempts to be internally
|
||
consistent, and uses standard-derived terminology if possible.
|
||
</para>
|
||
|
||
<para>
|
||
Another source of difference is in scope:
|
||
<literal>__gnu_pbds</literal> contains more types of associative
|
||
containers than the standard C++ library, and more opportunities
|
||
to configure these new containers, since different types of
|
||
associative containers are useful in different settings.
|
||
</para>
|
||
|
||
<para>
|
||
Namespace <literal>__gnu_pbds</literal> contains different classes for
|
||
hash-based containers, tree-based containers, trie-based containers,
|
||
and list-based containers.
|
||
</para>
|
||
|
||
<para>
|
||
Since associative containers share parts of their interface, they
|
||
are organized as a class hierarchy.
|
||
</para>
|
||
|
||
<para>Each type or method is defined in the most-common ancestor
|
||
in which it makes sense.
|
||
</para>
|
||
|
||
<para>For example, all associative containers support iteration
|
||
expressed in the following form:
|
||
</para>
|
||
|
||
<programlisting>
|
||
const_iterator
|
||
begin() const;
|
||
|
||
iterator
|
||
begin();
|
||
|
||
const_iterator
|
||
end() const;
|
||
|
||
iterator
|
||
end();
|
||
</programlisting>
|
||
|
||
<para>
|
||
But not all containers contain or use hash functors. Yet, both
|
||
collision-chaining and (general) probing hash-based associative
|
||
containers have a hash functor, so
|
||
<classname>basic_hash_table</classname> contains the interface:
|
||
</para>
|
||
|
||
<programlisting>
|
||
const hash_fn&
|
||
get_hash_fn() const;
|
||
|
||
hash_fn&
|
||
get_hash_fn();
|
||
</programlisting>
|
||
|
||
<para>
|
||
so all hash-based associative containers inherit the same
|
||
hash-functor accessor methods.
|
||
</para>
|
||
|
||
</section> <!--basic use -->
|
||
|
||
<section xml:id="pbds.using.tutorial.configuring">
|
||
<info>
|
||
<title>
|
||
Configuring via Template Parameters
|
||
</title>
|
||
</info>
|
||
|
||
<para>
|
||
In general, each of this library's containers is
|
||
parametrized by more policies than those of the standard library. For
|
||
example, the standard hash-based container is parametrized as
|
||
follows:
|
||
</para>
|
||
<programlisting>
|
||
template<typename Key, typename Mapped, typename Hash,
|
||
typename Pred, typename Allocator, bool Cache_Hashe_Code>
|
||
class unordered_map;
|
||
</programlisting>
|
||
|
||
<para>
|
||
and so can be configured by key type, mapped type, a functor
|
||
that translates keys to unsigned integral types, an equivalence
|
||
predicate, an allocator, and an indicator whether to store hash
|
||
values with each entry. this library's collision-chaining
|
||
hash-based container is parametrized as
|
||
</para>
|
||
<programlisting>
|
||
template<typename Key, typename Mapped, typename Hash_Fn,
|
||
typename Eq_Fn, typename Comb_Hash_Fn,
|
||
typename Resize_Policy, bool Store_Hash
|
||
typename Allocator>
|
||
class cc_hash_table;
|
||
</programlisting>
|
||
|
||
<para>
|
||
and so can be configured by the first four types of
|
||
<classname>std::tr1::unordered_map</classname>, then a
|
||
policy for translating the key-hash result into a position
|
||
within the table, then a policy by which the table resizes,
|
||
an indicator whether to store hash values with each entry,
|
||
and an allocator (which is typically the last template
|
||
parameter in standard containers).
|
||
</para>
|
||
|
||
<para>
|
||
Nearly all policy parameters have default values, so this
|
||
need not be considered for casual use. It is important to
|
||
note, however, that hash-based containers' policies can
|
||
dramatically alter their performance in different settings,
|
||
and that tree-based containers' policies can make them
|
||
useful for other purposes than just look-up.
|
||
</para>
|
||
|
||
|
||
<para>As opposed to associative containers, priority queues have
|
||
relatively few configuration options. The priority queue is
|
||
parametrized as follows:</para>
|
||
<programlisting>
|
||
template<typename Value_Type, typename Cmp_Fn,typename Tag,
|
||
typename Allocator>
|
||
class priority_queue;
|
||
</programlisting>
|
||
|
||
<para>The <classname>Value_Type</classname>, <classname>Cmp_Fn</classname>, and
|
||
<classname>Allocator</classname> parameters are the container's value type,
|
||
comparison-functor type, and allocator type, respectively;
|
||
these are very similar to the standard's priority queue. The
|
||
<classname>Tag</classname> parameter is different: there are a number of
|
||
pre-defined tag types corresponding to binary heaps, binomial
|
||
heaps, etc., and <classname>Tag</classname> should be instantiated
|
||
by one of them.</para>
|
||
|
||
<para>Note that as opposed to the
|
||
<classname>std::priority_queue</classname>,
|
||
<classname>__gnu_pbds::priority_queue</classname> is not a
|
||
sequence-adapter; it is a regular container.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.tutorial.traits">
|
||
<info>
|
||
<title>
|
||
Querying Container Attributes
|
||
</title>
|
||
</info>
|
||
<para></para>
|
||
|
||
<para>A containers underlying data structure
|
||
affect their performance; Unfortunately, they can also affect
|
||
their interface. When manipulating generically associative
|
||
containers, it is often useful to be able to statically
|
||
determine what they can support and what the cannot.
|
||
</para>
|
||
|
||
<para>Happily, the standard provides a good solution to a similar
|
||
problem - that of the different behavior of iterators. If
|
||
<classname>It</classname> is an iterator, then
|
||
</para>
|
||
<programlisting>
|
||
typename std::iterator_traits<It>::iterator_category
|
||
</programlisting>
|
||
|
||
<para>is one of a small number of pre-defined tag classes, and
|
||
</para>
|
||
<programlisting>
|
||
typename std::iterator_traits<It>::value_type
|
||
</programlisting>
|
||
|
||
<para>is the value type to which the iterator "points".</para>
|
||
|
||
<para>
|
||
Similarly, in this library, if <type>C</type> is a
|
||
container, then <classname>container_traits</classname> is a
|
||
trait class that stores information about the kind of
|
||
container that is implemented.
|
||
</para>
|
||
<programlisting>
|
||
typename container_traits<C>::container_category
|
||
</programlisting>
|
||
<para>
|
||
is one of a small number of predefined tag structures that
|
||
uniquely identifies the type of underlying data structure.
|
||
</para>
|
||
|
||
<para>In most cases, however, the exact underlying data
|
||
structure is not really important, but what is important is
|
||
one of its other attributes: whether it guarantees storing
|
||
elements by key order, for example. For this one can
|
||
use</para>
|
||
<programlisting>
|
||
typename container_traits<C>::order_preserving
|
||
</programlisting>
|
||
<para>
|
||
Also,
|
||
</para>
|
||
<programlisting>
|
||
typename container_traits<C>::invalidation_guarantee
|
||
</programlisting>
|
||
|
||
<para>is the container's invalidation guarantee. Invalidation
|
||
guarantees are especially important regarding priority queues,
|
||
since in this library's design, iterators are practically the
|
||
only way to manipulate them.</para>
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.tutorial.point_range_iteration">
|
||
<info>
|
||
<title>
|
||
Point and Range Iteration
|
||
</title>
|
||
</info>
|
||
<para></para>
|
||
|
||
<para>This library differentiates between two types of methods
|
||
and iterators: point-type, and range-type. For example,
|
||
<function>find</function> and <function>insert</function> are point-type methods, since
|
||
they each deal with a specific element; their returned
|
||
iterators are point-type iterators. <function>begin</function> and
|
||
<function>end</function> are range-type methods, since they are not used to
|
||
find a specific element, but rather to go over all elements in
|
||
a container object; their returned iterators are range-type
|
||
iterators.
|
||
</para>
|
||
|
||
<para>Most containers store elements in an order that is
|
||
determined by their interface. Correspondingly, it is fine that
|
||
their point-type iterators are synonymous with their range-type
|
||
iterators. For example, in the following snippet
|
||
</para>
|
||
<programlisting>
|
||
std::for_each(c.find(1), c.find(5), foo);
|
||
</programlisting>
|
||
<para>
|
||
two point-type iterators (returned by <function>find</function>) are used
|
||
for a range-type purpose - going over all elements whose key is
|
||
between 1 and 5.
|
||
</para>
|
||
|
||
<para>
|
||
Conversely, the above snippet makes no sense for
|
||
self-organizing containers - ones that order (and reorder)
|
||
their elements by implementation. It would be nice to have a
|
||
uniform iterator system that would allow the above snippet to
|
||
compile only if it made sense.
|
||
</para>
|
||
|
||
<para>
|
||
This could trivially be done by specializing
|
||
<function>std::for_each</function> for the case of iterators returned by
|
||
<classname>std::tr1::unordered_map</classname>, but this would only solve the
|
||
problem for one algorithm and one container. Fundamentally, the
|
||
problem is that one can loop using a self-organizing
|
||
container's point-type iterators.
|
||
</para>
|
||
|
||
<para>
|
||
This library's containers define two families of
|
||
iterators: <type>point_const_iterator</type> and
|
||
<type>point_iterator</type> are the iterator types returned by
|
||
point-type methods; <type>const_iterator</type> and
|
||
<type>iterator</type> are the iterator types returned by range-type
|
||
methods.
|
||
</para>
|
||
<programlisting>
|
||
class <- some container ->
|
||
{
|
||
public:
|
||
...
|
||
|
||
typedef <- something -> const_iterator;
|
||
|
||
typedef <- something -> iterator;
|
||
|
||
typedef <- something -> point_const_iterator;
|
||
|
||
typedef <- something -> point_iterator;
|
||
|
||
...
|
||
|
||
public:
|
||
...
|
||
|
||
const_iterator begin () const;
|
||
|
||
iterator begin();
|
||
|
||
point_const_iterator find(...) const;
|
||
|
||
point_iterator find(...);
|
||
};
|
||
</programlisting>
|
||
|
||
<para>For
|
||
containers whose interface defines sequence order , it
|
||
is very simple: point-type and range-type iterators are exactly
|
||
the same, which means that the above snippet will compile if it
|
||
is used for an order-preserving associative container.
|
||
</para>
|
||
|
||
<para>
|
||
For self-organizing containers, however, (hash-based
|
||
containers as a special example), the preceding snippet will
|
||
not compile, because their point-type iterators do not support
|
||
<function>operator++</function>.
|
||
</para>
|
||
|
||
<para>In any case, both for order-preserving and self-organizing
|
||
containers, the following snippet will compile:
|
||
</para>
|
||
<programlisting>
|
||
typename Cntnr::point_iterator it = c.find(2);
|
||
</programlisting>
|
||
|
||
<para>
|
||
because a range-type iterator can always be converted to a
|
||
point-type iterator.
|
||
</para>
|
||
|
||
<para>Distingushing between iterator types also
|
||
raises the point that a container's iterators might have
|
||
different invalidation rules concerning their de-referencing
|
||
abilities and movement abilities. This now corresponds exactly
|
||
to the question of whether point-type and range-type iterators
|
||
are valid. As explained above, <classname>container_traits</classname> allows
|
||
querying a container for its data structure attributes. The
|
||
iterator-invalidation guarantees are certainly a property of
|
||
the underlying data structure, and so
|
||
</para>
|
||
<programlisting>
|
||
container_traits<C>::invalidation_guarantee
|
||
</programlisting>
|
||
|
||
<para>
|
||
gives one of three pre-determined types that answer this
|
||
query.
|
||
</para>
|
||
|
||
</section>
|
||
</section> <!-- tutorial -->
|
||
|
||
<section xml:id="pbds.using.examples">
|
||
<info><title>Examples</title></info>
|
||
<para>
|
||
Additional code examples are provided in the source
|
||
distribution, as part of the regression and performance
|
||
testsuite.
|
||
</para>
|
||
|
||
<section xml:id="pbds.using.examples.basic">
|
||
<info><title>Intermediate Use</title></info>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Basic use of maps:
|
||
<filename>basic_map.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Basic use of sets:
|
||
<filename>basic_set.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Conditionally erasing values from an associative container object:
|
||
<filename>erase_if.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Basic use of multimaps:
|
||
<filename>basic_multimap.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Basic use of multisets:
|
||
<filename>basic_multiset.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Basic use of priority queues:
|
||
<filename>basic_priority_queue.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Splitting and joining priority queues:
|
||
<filename>priority_queue_split_join.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Conditionally erasing values from a priority queue:
|
||
<filename>priority_queue_erase_if.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.query">
|
||
<info><title>Querying with <classname>container_traits</classname> </title></info>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Using <classname>container_traits</classname> to query
|
||
about underlying data structure behavior:
|
||
<filename>assoc_container_traits.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
A non-compiling example showing wrong use of finding keys in
|
||
hash-based containers: <filename>hash_find_neg.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
Using <classname>container_traits</classname>
|
||
to query about underlying data structure behavior:
|
||
<filename>priority_queue_container_traits.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container">
|
||
<info><title>By Container Method</title></info>
|
||
<para></para>
|
||
|
||
<section xml:id="pbds.using.examples.container.hash">
|
||
<info><title>Hash-Based</title></info>
|
||
|
||
<section xml:id="pbds.using.examples.container.hash.resize">
|
||
<info><title>size Related</title></info>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Setting the initial size of a hash-based container
|
||
object:
|
||
<filename>hash_initial_size.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
A non-compiling example showing how not to resize a
|
||
hash-based container object:
|
||
<filename>hash_resize_neg.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Resizing the size of a hash-based container object:
|
||
<filename>hash_resize.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Showing an illegal resize of a hash-based container
|
||
object:
|
||
<filename>hash_illegal_resize.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Changing the load factors of a hash-based container
|
||
object: <filename>hash_load_set_change.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container.hash.hashor">
|
||
<info><title>Hashing Function Related</title></info>
|
||
<para></para>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Using a modulo range-hashing function for the case of an
|
||
unknown skewed key distribution:
|
||
<filename>hash_mod.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Writing a range-hashing functor for the case of a known
|
||
skewed key distribution:
|
||
<filename>shift_mask.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Storing the hash value along with each key:
|
||
<filename>store_hash.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Writing a ranged-hash functor:
|
||
<filename>ranged_hash.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container.branch">
|
||
<info><title>Branch-Based</title></info>
|
||
|
||
|
||
<section xml:id="pbds.using.examples.container.branch.split">
|
||
<info><title>split or join Related</title></info>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Joining two tree-based container objects:
|
||
<filename>tree_join.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Splitting a PATRICIA trie container object:
|
||
<filename>trie_split.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Order statistics while joining two tree-based container
|
||
objects:
|
||
<filename>tree_order_statistics_join.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container.branch.invariants">
|
||
<info><title>Node Invariants</title></info>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Using trees for order statistics:
|
||
<filename>tree_order_statistics.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Augmenting trees to support operations on line
|
||
intervals:
|
||
<filename>tree_intervals.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container.branch.trie">
|
||
<info><title>trie</title></info>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Using a PATRICIA trie for DNA strings:
|
||
<filename>trie_dna.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Using a PATRICIA
|
||
trie for finding all entries whose key matches a given prefix:
|
||
<filename>trie_prefix_search.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.using.examples.container.priority_queue">
|
||
<info><title>Priority Queues</title></info>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Cross referencing an associative container and a priority
|
||
queue: <filename>priority_queue_xref.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Cross referencing a vector and a priority queue using a
|
||
very simple version of Dijkstra's shortest path
|
||
algorithm:
|
||
<filename>priority_queue_dijkstra.cc</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
</section>
|
||
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
</section> <!-- using -->
|
||
|
||
<!-- S03: Design -->
|
||
|
||
|
||
<section xml:id="containers.pbds.design">
|
||
<info><title>Design</title></info>
|
||
<?dbhtml filename="policy_data_structures_design.html"?>
|
||
<para></para>
|
||
|
||
<section xml:id="pbds.design.concepts">
|
||
<info><title>Concepts</title></info>
|
||
|
||
<section xml:id="pbds.design.concepts.null_type">
|
||
<info><title>Null Policy Classes</title></info>
|
||
|
||
<para>
|
||
Associative containers are typically parametrized by various
|
||
policies. For example, a hash-based associative container is
|
||
parametrized by a hash-functor, transforming each key into an
|
||
non-negative numerical type. Each such value is then further mapped
|
||
into a position within the table. The mapping of a key into a
|
||
position within the table is therefore a two-step process.
|
||
</para>
|
||
|
||
<para>
|
||
In some cases, instantiations are redundant. For example, when the
|
||
keys are integers, it is possible to use a redundant hash policy,
|
||
which transforms each key into its value.
|
||
</para>
|
||
|
||
<para>
|
||
In some other cases, these policies are irrelevant. For example, a
|
||
hash-based associative container might transform keys into positions
|
||
within a table by a different method than the two-step method
|
||
described above. In such a case, the hash functor is simply
|
||
irrelevant.
|
||
</para>
|
||
|
||
<para>
|
||
When a policy is either redundant or irrelevant, it can be replaced
|
||
by <classname>null_type</classname>.
|
||
</para>
|
||
|
||
<para>
|
||
For example, a <emphasis>set</emphasis> is an associative
|
||
container with one of its template parameters (the one for the
|
||
mapped type) replaced with <classname>null_type</classname>. Other
|
||
places simplifications are made possible with this technique
|
||
include node updates in tree and trie data structures, and hash
|
||
and probe functions for hash data structures.
|
||
</para>
|
||
</section>
|
||
|
||
<section xml:id="pbds.design.concepts.associative_semantics">
|
||
<info><title>Map and Set Semantics</title></info>
|
||
|
||
<section xml:id="concepts.associative_semantics.set_vs_map">
|
||
<info>
|
||
<title>
|
||
Distinguishing Between Maps and Sets
|
||
</title>
|
||
</info>
|
||
|
||
<para>
|
||
Anyone familiar with the standard knows that there are four kinds
|
||
of associative containers: maps, sets, multimaps, and
|
||
multisets. The map datatype associates each key to
|
||
some data.
|
||
</para>
|
||
|
||
<para>
|
||
Sets are associative containers that simply store keys -
|
||
they do not map them to anything. In the standard, each map class
|
||
has a corresponding set class. E.g.,
|
||
<classname>std::map<int, char></classname> maps each
|
||
<classname>int</classname> to a <classname>char</classname>, but
|
||
<classname>std::set<int, char></classname> simply stores
|
||
<classname>int</classname>s. In this library, however, there are no
|
||
distinct classes for maps and sets. Instead, an associative
|
||
container's <classname>Mapped</classname> template parameter is a policy: if
|
||
it is instantiated by <classname>null_type</classname>, then it
|
||
is a "set"; otherwise, it is a "map". E.g.,
|
||
</para>
|
||
<programlisting>
|
||
cc_hash_table<int, char>
|
||
</programlisting>
|
||
<para>
|
||
is a "map" mapping each <type>int</type> value to a <type>
|
||
char</type>, but
|
||
</para>
|
||
<programlisting>
|
||
cc_hash_table<int, null_type>
|
||
</programlisting>
|
||
<para>
|
||
is a type that uniquely stores <type>int</type> values.
|
||
</para>
|
||
<para>Once the <classname>Mapped</classname> template parameter is instantiated
|
||
by <classname>null_type</classname>, then
|
||
the "set" acts very similarly to the standard's sets - it does not
|
||
map each key to a distinct <classname>null_type</classname> object. Also,
|
||
, the container's <type>value_type</type> is essentially
|
||
its <type>key_type</type> - just as with the standard's sets
|
||
.</para>
|
||
|
||
<para>
|
||
The standard's multimaps and multisets allow, respectively,
|
||
non-uniquely mapping keys and non-uniquely storing keys. As
|
||
discussed, the
|
||
reasons why this might be necessary are 1) that a key might be
|
||
decomposed into a primary key and a secondary key, 2) that a
|
||
key might appear more than once, or 3) any arbitrary
|
||
combination of 1)s and 2)s. Correspondingly,
|
||
one should use 1) "maps" mapping primary keys to secondary
|
||
keys, 2) "maps" mapping keys to size types, or 3) any arbitrary
|
||
combination of 1)s and 2)s. Thus, for example, an
|
||
<classname>std::multiset<int></classname> might be used to store
|
||
multiple instances of integers, but using this library's
|
||
containers, one might use
|
||
</para>
|
||
<programlisting>
|
||
tree<int, size_t>
|
||
</programlisting>
|
||
|
||
<para>
|
||
i.e., a <classname>map</classname> of <type>int</type>s to
|
||
<type>size_t</type>s.
|
||
</para>
|
||
<para>
|
||
These "multimaps" and "multisets" might be confusing to
|
||
anyone familiar with the standard's <classname>std::multimap</classname> and
|
||
<classname>std::multiset</classname>, because there is no clear
|
||
correspondence between the two. For example, in some cases
|
||
where one uses <classname>std::multiset</classname> in the standard, one might use
|
||
in this library a "multimap" of "multisets" - i.e., a
|
||
container that maps primary keys each to an associative
|
||
container that maps each secondary key to the number of times
|
||
it occurs.
|
||
</para>
|
||
|
||
<para>
|
||
When one uses a "multimap," one should choose with care the
|
||
type of container used for secondary keys.
|
||
</para>
|
||
</section> <!-- map vs set -->
|
||
|
||
|
||
<section xml:id="concepts.associative_semantics.multi">
|
||
<info><title>Alternatives to <classname>std::multiset</classname> and <classname>std::multimap</classname></title></info>
|
||
|
||
<para>
|
||
Brace onself: this library does not contain containers like
|
||
<classname>std::multimap</classname> or
|
||
<classname>std::multiset</classname>. Instead, these data
|
||
structures can be synthesized via manipulation of the
|
||
<classname>Mapped</classname> template parameter.
|
||
</para>
|
||
<para>
|
||
One maps the unique part of a key - the primary key, into an
|
||
associative-container of the (originally) non-unique parts of
|
||
the key - the secondary key. A primary associative-container
|
||
is an associative container of primary keys; a secondary
|
||
associative-container is an associative container of
|
||
secondary keys.
|
||
</para>
|
||
|
||
<para>
|
||
Stepping back a bit, and starting in from the beginning.
|
||
</para>
|
||
|
||
|
||
<para>
|
||
Maps (or sets) allow mapping (or storing) unique-key values.
|
||
The standard library also supplies associative containers which
|
||
map (or store) multiple values with equivalent keys:
|
||
<classname>std::multimap</classname>, <classname>std::multiset</classname>,
|
||
<classname>std::tr1::unordered_multimap</classname>, and
|
||
<classname>unordered_multiset</classname>. We first discuss how these might
|
||
be used, then why we think it is best to avoid them.
|
||
</para>
|
||
|
||
<para>
|
||
Suppose one builds a simple bank-account application that
|
||
records for each client (identified by an <classname>std::string</classname>)
|
||
and account-id (marked by an <type>unsigned long</type>) -
|
||
the balance in the account (described by a
|
||
<type>float</type>). Suppose further that ordering this
|
||
information is not useful, so a hash-based container is
|
||
preferable to a tree based container. Then one can use
|
||
</para>
|
||
|
||
<programlisting>
|
||
std::tr1::unordered_map<std::pair<std::string, unsigned long>, float, ...>
|
||
</programlisting>
|
||
|
||
<para>
|
||
which hashes every combination of client and account-id. This
|
||
might work well, except for the fact that it is now impossible
|
||
to efficiently list all of the accounts of a specific client
|
||
(this would practically require iterating over all
|
||
entries). Instead, one can use
|
||
</para>
|
||
|
||
<programlisting>
|
||
std::tr1::unordered_multimap<std::pair<std::string, unsigned long>, float, ...>
|
||
</programlisting>
|
||
|
||
<para>
|
||
which hashes every client, and decides equivalence based on
|
||
client only. This will ensure that all accounts belonging to a
|
||
specific user are stored consecutively.
|
||
</para>
|
||
|
||
<para>
|
||
Also, suppose one wants an integers' priority queue
|
||
(a container that supports <function>push</function>,
|
||
<function>pop</function>, and <function>top</function> operations, the last of which
|
||
returns the largest <type>int</type>) that also supports
|
||
operations such as <function>find</function> and <function>lower_bound</function>. A
|
||
reasonable solution is to build an adapter over
|
||
<classname>std::set<int></classname>. In this adapter,
|
||
<function>push</function> will just call the tree-based
|
||
associative container's <function>insert</function> method; <function>pop</function>
|
||
will call its <function>end</function> method, and use it to return the
|
||
preceding element (which must be the largest). Then this might
|
||
work well, except that the container object cannot hold
|
||
multiple instances of the same integer (<function>push(4)</function>,
|
||
will be a no-op if <constant>4</constant> is already in the
|
||
container object). If multiple keys are necessary, then one
|
||
might build the adapter over an
|
||
<classname>std::multiset<int></classname>.
|
||
</para>
|
||
|
||
<para>
|
||
The standard library's non-unique-mapping containers are useful
|
||
when (1) a key can be decomposed in to a primary key and a
|
||
secondary key, (2) a key is needed multiple times, or (3) any
|
||
combination of (1) and (2).
|
||
</para>
|
||
|
||
<para>
|
||
The graphic below shows how the standard library's container
|
||
design works internally; in this figure nodes shaded equally
|
||
represent equivalent-key values. Equivalent keys are stored
|
||
consecutively using the properties of the underlying data
|
||
structure: binary search trees (label A) store equivalent-key
|
||
values consecutively (in the sense of an in-order walk)
|
||
naturally; collision-chaining hash tables (label B) store
|
||
equivalent-key values in the same bucket, the bucket can be
|
||
arranged so that equivalent-key values are consecutive.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Non-unique Mapping Standard Containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_embedded_lists_1.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Non-unique Mapping Standard Containers</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
Put differently, the standards' non-unique mapping
|
||
associative-containers are associative containers that map
|
||
primary keys to linked lists that are embedded into the
|
||
container. The graphic below shows again the two
|
||
containers from the first graphic above, this time with
|
||
the embedded linked lists of the grayed nodes marked
|
||
explicitly.
|
||
</para>
|
||
|
||
<figure xml:id="fig.pbds_embedded_lists_2">
|
||
<title>
|
||
Effect of embedded lists in
|
||
<classname>std::multimap</classname>
|
||
</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_embedded_lists_2.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>
|
||
Effect of embedded lists in
|
||
<classname>std::multimap</classname>
|
||
</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
These embedded linked lists have several disadvantages.
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
The underlying data structure embeds the linked lists
|
||
according to its own consideration, which means that the
|
||
search path for a value might include several different
|
||
equivalent-key values. For example, the search path for the
|
||
the black node in either of the first graphic, labels A or B,
|
||
includes more than a single gray node.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
The links of the linked lists are the underlying data
|
||
structures' nodes, which typically are quite structured. In
|
||
the case of tree-based containers (the grapic above, label
|
||
B), each "link" is actually a node with three pointers (one
|
||
to a parent and two to children), and a
|
||
relatively-complicated iteration algorithm. The linked
|
||
lists, therefore, can take up quite a lot of memory, and
|
||
iterating over all values equal to a given key (through the
|
||
return value of the standard
|
||
library's <function>equal_range</function>) can be
|
||
expensive.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
The primary key is stored multiply; this uses more memory.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Finally, the interface of this design excludes several
|
||
useful underlying data structures. Of all the unordered
|
||
self-organizing data structures, practically only
|
||
collision-chaining hash tables can (efficiently) guarantee
|
||
that equivalent-key values are stored consecutively.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<para>
|
||
The above reasons hold even when the ratio of secondary keys to
|
||
primary keys (or average number of identical keys) is small, but
|
||
when it is large, there are more severe problems:
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
The underlying data structures order the links inside each
|
||
embedded linked-lists according to their internal
|
||
considerations, which effectively means that each of the
|
||
links is unordered. Irrespective of the underlying data
|
||
structure, searching for a specific value can degrade to
|
||
linear complexity.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Similarly to the above point, it is impossible to apply
|
||
to the secondary keys considerations that apply to primary
|
||
keys. For example, it is not possible to maintain secondary
|
||
keys by sorted order.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
While the interface "understands" that all equivalent-key
|
||
values constitute a distinct list (through
|
||
<function>equal_range</function>), the underlying data
|
||
structure typically does not. This means that operations such
|
||
as erasing from a tree-based container all values whose keys
|
||
are equivalent to a a given key can be super-linear in the
|
||
size of the tree; this is also true also for several other
|
||
operations that target a specific list.
|
||
</para>
|
||
</listitem>
|
||
|
||
</orderedlist>
|
||
|
||
<para>
|
||
In this library, all associative containers map
|
||
(or store) unique-key values. One can (1) map primary keys to
|
||
secondary associative-containers (containers of
|
||
secondary keys) or non-associative containers (2) map identical
|
||
keys to a size-type representing the number of times they
|
||
occur, or (3) any combination of (1) and (2). Instead of
|
||
allowing multiple equivalent-key values, this library
|
||
supplies associative containers based on underlying
|
||
data structures that are suitable as secondary
|
||
associative-containers.
|
||
</para>
|
||
|
||
<para>
|
||
In the figure below, labels A and B show the equivalent
|
||
underlying data structures in this library, as mapped to the
|
||
first graphic above. Labels A and B, respectively. Each shaded
|
||
box represents some size-type or secondary
|
||
associative-container.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Non-unique Mapping Containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_embedded_lists_3.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Non-unique Mapping Containers</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
In the first example above, then, one would use an associative
|
||
container mapping each user to an associative container which
|
||
maps each application id to a start time (see
|
||
<filename>example/basic_multimap.cc</filename>); in the second
|
||
example, one would use an associative container mapping
|
||
each <classname>int</classname> to some size-type indicating the
|
||
number of times it logically occurs
|
||
(see <filename>example/basic_multiset.cc</filename>.
|
||
</para>
|
||
|
||
<para>
|
||
See the discussion in list-based container types for containers
|
||
especially suited as secondary associative-containers.
|
||
</para>
|
||
</section>
|
||
|
||
</section> <!-- map and set semantics -->
|
||
|
||
<section xml:id="pbds.design.concepts.iterator_semantics">
|
||
<info><title>Iterator Semantics</title></info>
|
||
|
||
<section xml:id="concepts.iterator_semantics.point_and_range">
|
||
<info><title>Point and Range Iterators</title></info>
|
||
|
||
<para>
|
||
Iterator concepts are bifurcated in this design, and are
|
||
comprised of point-type and range-type iteration.
|
||
</para>
|
||
|
||
<para>
|
||
A point-type iterator is an iterator that refers to a specific
|
||
element as returned through an
|
||
associative-container's <function>find</function> method.
|
||
</para>
|
||
|
||
<para>
|
||
A range-type iterator is an iterator that is used to go over a
|
||
sequence of elements, as returned by a container's
|
||
<function>find</function> method.
|
||
</para>
|
||
|
||
<para>
|
||
A point-type method is a method that
|
||
returns a point-type iterator; a range-type method is a method
|
||
that returns a range-type iterator.
|
||
</para>
|
||
|
||
<para>For most containers, these types are synonymous; for
|
||
self-organizing containers, such as hash-based containers or
|
||
priority queues, these are inherently different (in any
|
||
implementation, including that of C++ standard library
|
||
components), but in this design, it is made explicit. They are
|
||
distinct types.
|
||
</para>
|
||
</section>
|
||
|
||
|
||
<section xml:id="concepts.iterator_semantics.both">
|
||
<info><title>Distinguishing Point and Range Iterators</title></info>
|
||
|
||
<para>When using this library, is necessary to differentiate
|
||
between two types of methods and iterators: point-type methods and
|
||
iterators, and range-type methods and iterators. Each associative
|
||
container's interface includes the methods:</para>
|
||
<programlisting>
|
||
point_const_iterator
|
||
find(const_key_reference r_key) const;
|
||
|
||
point_iterator
|
||
find(const_key_reference r_key);
|
||
|
||
std::pair<point_iterator,bool>
|
||
insert(const_reference r_val);
|
||
</programlisting>
|
||
|
||
<para>The relationship between these iterator types varies between
|
||
container types. The figure below
|
||
shows the most general invariant between point-type and
|
||
range-type iterators: In <emphasis>A</emphasis> <literal>iterator</literal>, can
|
||
always be converted to <literal>point_iterator</literal>. In <emphasis>B</emphasis>
|
||
shows invariants for order-preserving containers: point-type
|
||
iterators are synonymous with range-type iterators.
|
||
Orthogonally, <emphasis>C</emphasis>shows invariants for "set"
|
||
containers: iterators are synonymous with const iterators.</para>
|
||
|
||
<figure>
|
||
<title>Point Iterator Hierarchy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_point_iterator_hierarchy.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Point Iterator Hierarchy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
<para>Note that point-type iterators in self-organizing containers
|
||
(hash-based associative containers) lack movement
|
||
operators, such as <literal>operator++</literal> - in fact, this
|
||
is the reason why this library differentiates from the standard C++ librarys
|
||
design on this point.</para>
|
||
|
||
<para>Typically, one can determine an iterator's movement
|
||
capabilities using
|
||
<literal>std::iterator_traits<It>iterator_category</literal>,
|
||
which is a <literal>struct</literal> indicating the iterator's
|
||
movement capabilities. Unfortunately, none of the standard predefined
|
||
categories reflect a pointer's <emphasis>not</emphasis> having any
|
||
movement capabilities whatsoever. Consequently,
|
||
<literal>pb_ds</literal> adds a type
|
||
<literal>trivial_iterator_tag</literal> (whose name is taken from
|
||
a concept in C++ standardese, which is the category of iterators
|
||
with no movement capabilities.) All other standard C++ library
|
||
tags, such as <literal>forward_iterator_tag</literal> retain their
|
||
common use.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="pbds.design.concepts.invalidation">
|
||
<info><title>Invalidation Guarantees</title></info>
|
||
<para>
|
||
If one manipulates a container object, then iterators previously
|
||
obtained from it can be invalidated. In some cases a
|
||
previously-obtained iterator cannot be de-referenced; in other cases,
|
||
the iterator's next or previous element might have changed
|
||
unpredictably. This corresponds exactly to the question whether a
|
||
point-type or range-type iterator (see previous concept) is valid or
|
||
not. In this design, one can query a container (in compile time) about
|
||
its invalidation guarantees.
|
||
</para>
|
||
|
||
|
||
<para>
|
||
Given three different types of associative containers, a modifying
|
||
operation (in that example, <function>erase</function>) invalidated
|
||
iterators in three different ways: the iterator of one container
|
||
remained completely valid - it could be de-referenced and
|
||
incremented; the iterator of a different container could not even be
|
||
de-referenced; the iterator of the third container could be
|
||
de-referenced, but its "next" iterator changed unpredictably.
|
||
</para>
|
||
|
||
<para>
|
||
Distinguishing between find and range types allows fine-grained
|
||
invalidation guarantees, because these questions correspond exactly
|
||
to the question of whether point-type iterators and range-type
|
||
iterators are valid. The graphic below shows tags corresponding to
|
||
different types of invalidation guarantees.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Invalidation Guarantee Tags Hierarchy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PDF" scale="75"
|
||
fileref="../images/pbds_invalidation_tag_hierarchy.pdf"/>
|
||
</imageobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_invalidation_tag_hierarchy.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Invalidation Guarantee Tags Hierarchy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>basic_invalidation_guarantee</classname>
|
||
corresponds to a basic guarantee that a point-type iterator,
|
||
a found pointer, or a found reference, remains valid as long
|
||
as the container object is not modified.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>point_invalidation_guarantee</classname>
|
||
corresponds to a guarantee that a point-type iterator, a
|
||
found pointer, or a found reference, remains valid even if
|
||
the container object is modified.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>range_invalidation_guarantee</classname>
|
||
corresponds to a guarantee that a range-type iterator remains
|
||
valid even if the container object is modified.
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
<para>To find the invalidation guarantee of a
|
||
container, one can use</para>
|
||
<programlisting>
|
||
typename container_traits<Cntnr>::invalidation_guarantee
|
||
</programlisting>
|
||
|
||
<para>Note that this hierarchy corresponds to the logic it
|
||
represents: if a container has range-invalidation guarantees,
|
||
then it must also have find invalidation guarantees;
|
||
correspondingly, its invalidation guarantee (in this case
|
||
<classname>range_invalidation_guarantee</classname>)
|
||
can be cast to its base class (in this case <classname>point_invalidation_guarantee</classname>).
|
||
This means that this this hierarchy can be used easily using
|
||
standard metaprogramming techniques, by specializing on the
|
||
type of <literal>invalidation_guarantee</literal>.</para>
|
||
|
||
<para>
|
||
These types of problems were addressed, in a more general
|
||
setting, in <xref linkend="biblio.meyers96more"/> - Item 2. In
|
||
our opinion, an invalidation-guarantee hierarchy would solve
|
||
these problems in all container types - not just associative
|
||
containers.
|
||
</para>
|
||
|
||
</section>
|
||
</section> <!-- iterator semantics -->
|
||
|
||
<section xml:id="pbds.design.concepts.genericity">
|
||
<info><title>Genericity</title></info>
|
||
|
||
<para>
|
||
The design attempts to address the following problem of
|
||
data-structure genericity. When writing a function manipulating
|
||
a generic container object, what is the behavior of the object?
|
||
Suppose one writes
|
||
</para>
|
||
<programlisting>
|
||
template<typename Cntnr>
|
||
void
|
||
some_op_sequence(Cntnr &r_container)
|
||
{
|
||
...
|
||
}
|
||
</programlisting>
|
||
|
||
<para>
|
||
then one needs to address the following questions in the body
|
||
of <function>some_op_sequence</function>:
|
||
</para>
|
||
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
Which types and methods does <literal>Cntnr</literal> support?
|
||
Containers based on hash tables can be queries for the
|
||
hash-functor type and object; this is meaningless for tree-based
|
||
containers. Containers based on trees can be split, joined, or
|
||
can erase iterators and return the following iterator; this
|
||
cannot be done by hash-based containers.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
What are the exception and invalidation guarantees
|
||
of <literal>Cntnr</literal>? A container based on a probing
|
||
hash-table invalidates all iterators when it is modified; this
|
||
is not the case for containers based on node-based
|
||
trees. Containers based on a node-based tree can be split or
|
||
joined without exceptions; this is not the case for containers
|
||
based on vector-based trees.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
How does the container maintain its elements? Tree-based and
|
||
Trie-based containers store elements by key order; others,
|
||
typically, do not. A container based on a splay trees or lists
|
||
with update policies "cache" "frequently accessed" elements;
|
||
containers based on most other underlying data structures do
|
||
not.
|
||
</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
How does one query a container about characteristics and
|
||
capabilities? What is the relationship between two different
|
||
data structures, if anything?
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
<para>The remainder of this section explains these issues in
|
||
detail.</para>
|
||
|
||
|
||
<section xml:id="concepts.genericity.tag">
|
||
<info><title>Tag</title></info>
|
||
<para>
|
||
Tags are very useful for manipulating generic types. For example, if
|
||
<literal>It</literal> is an iterator class, then <literal>typename
|
||
It::iterator_category</literal> or <literal>typename
|
||
std::iterator_traits<It>::iterator_category</literal> will
|
||
yield its category, and <literal>typename
|
||
std::iterator_traits<It>::value_type</literal> will yield its
|
||
value type.
|
||
</para>
|
||
|
||
<para>
|
||
This library contains a container tag hierarchy corresponding to the
|
||
diagram below.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Container Tag Hierarchy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PDF" scale="75"
|
||
fileref="../images/pbds_container_tag_hierarchy.pdf"/>
|
||
</imageobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_container_tag_hierarchy.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Container Tag Hierarchy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>
|
||
Given any container <type>Cntnr</type>, the tag of
|
||
the underlying data structure can be found via <literal>typename
|
||
Cntnr::container_category</literal>.
|
||
</para>
|
||
|
||
</section> <!-- tag -->
|
||
|
||
<section xml:id="concepts.genericity.traits">
|
||
<info><title>Traits</title></info>
|
||
<para></para>
|
||
|
||
<para>Additionally, a traits mechanism can be used to query a
|
||
container type for its attributes. Given any container
|
||
<literal>Cntnr</literal>, then <literal><Cntnr></literal>
|
||
is a traits class identifying the properties of the
|
||
container.</para>
|
||
|
||
<para>To find if a container can throw when a key is erased (which
|
||
is true for vector-based trees, for example), one can
|
||
use
|
||
</para>
|
||
<programlisting>container_traits<Cntnr>::erase_can_throw</programlisting>
|
||
|
||
<para>
|
||
Some of the definitions in <classname>container_traits</classname>
|
||
are dependent on other
|
||
definitions. If <classname>container_traits<Cntnr>::order_preserving</classname>
|
||
is <constant>true</constant> (which is the case for containers
|
||
based on trees and tries), then the container can be split or
|
||
joined; in this
|
||
case, <classname>container_traits<Cntnr>::split_join_can_throw</classname>
|
||
indicates whether splits or joins can throw exceptions (which is
|
||
true for vector-based trees);
|
||
otherwise <classname>container_traits<Cntnr>::split_join_can_throw</classname>
|
||
will yield a compilation error. (This is somewhat similar to a
|
||
compile-time version of the COM model).
|
||
</para>
|
||
|
||
</section> <!-- traits -->
|
||
|
||
</section> <!-- genericity -->
|
||
</section> <!-- concepts -->
|
||
|
||
<section xml:id="pbds.design.container">
|
||
<info><title>By Container</title></info>
|
||
|
||
<!-- hash -->
|
||
<section xml:id="pbds.design.container.hash">
|
||
<info><title>hash</title></info>
|
||
|
||
<!--
|
||
|
||
// hash policies
|
||
/// general terms / background
|
||
/// range hashing policies
|
||
/// ranged-hash policies
|
||
/// implementation
|
||
|
||
// resize policies
|
||
/// general
|
||
/// size policies
|
||
/// trigger policies
|
||
/// implementation
|
||
|
||
// policy interactions
|
||
/// probe/size/trigger
|
||
/// hash/trigger
|
||
/// eq/hash/storing hash values
|
||
/// size/load-check trigger
|
||
-->
|
||
<section xml:id="container.hash.interface">
|
||
<info><title>Interface</title></info>
|
||
|
||
|
||
|
||
<para>
|
||
The collision-chaining hash-based container has the
|
||
following declaration.</para>
|
||
<programlisting>
|
||
template<
|
||
typename Key,
|
||
typename Mapped,
|
||
typename Hash_Fn = std::hash<Key>,
|
||
typename Eq_Fn = std::equal_to<Key>,
|
||
typename Comb_Hash_Fn = direct_mask_range_hashing<>
|
||
typename Resize_Policy = default explained below.
|
||
bool Store_Hash = false,
|
||
typename Allocator = std::allocator<char> >
|
||
class cc_hash_table;
|
||
</programlisting>
|
||
|
||
<para>The parameters have the following meaning:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>Key</classname> is the key type.</para></listitem>
|
||
|
||
<listitem><para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
|
||
|
||
<listitem><para><classname>Hash_Fn</classname> is a key hashing functor.</para></listitem>
|
||
|
||
<listitem><para><classname>Eq_Fn</classname> is a key equivalence functor.</para></listitem>
|
||
|
||
<listitem><para><classname>Comb_Hash_Fn</classname> is a range-hashing_functor;
|
||
it describes how to translate hash values into positions
|
||
within the table. </para></listitem>
|
||
|
||
<listitem><para><classname>Resize_Policy</classname> describes how a container object
|
||
should change its internal size. </para></listitem>
|
||
|
||
<listitem><para><classname>Store_Hash</classname> indicates whether the hash value
|
||
should be stored with each entry. </para></listitem>
|
||
|
||
<listitem><para><classname>Allocator</classname> is an allocator
|
||
type.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The probing hash-based container has the following
|
||
declaration.</para>
|
||
<programlisting>
|
||
template<
|
||
typename Key,
|
||
typename Mapped,
|
||
typename Hash_Fn = std::hash<Key>,
|
||
typename Eq_Fn = std::equal_to<Key>,
|
||
typename Comb_Probe_Fn = direct_mask_range_hashing<>
|
||
typename Probe_Fn = default explained below.
|
||
typename Resize_Policy = default explained below.
|
||
bool Store_Hash = false,
|
||
typename Allocator = std::allocator<char> >
|
||
class gp_hash_table;
|
||
</programlisting>
|
||
|
||
<para>The parameters are identical to those of the
|
||
collision-chaining container, except for the following.</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>Comb_Probe_Fn</classname> describes how to transform a probe
|
||
sequence into a sequence of positions within the table.</para></listitem>
|
||
|
||
<listitem><para><classname>Probe_Fn</classname> describes a probe sequence policy.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>Some of the default template values depend on the values of
|
||
other parameters, and are explained below.</para>
|
||
|
||
</section>
|
||
<section xml:id="container.hash.details">
|
||
<info><title>Details</title></info>
|
||
|
||
<section xml:id="container.hash.details.hash_policies">
|
||
<info><title>Hash Policies</title></info>
|
||
|
||
<section xml:id="details.hash_policies.general">
|
||
<info><title>General</title></info>
|
||
|
||
<para>Following is an explanation of some functions which hashing
|
||
involves. The graphic below illustrates the discussion.</para>
|
||
|
||
<figure>
|
||
<title>Hash functions, ranged-hash functions, and
|
||
range-hashing functions</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_hash_ranged_hash_range_hashing_fns.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Hash functions, ranged-hash functions, and
|
||
range-hashing functions</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>Let U be a domain (e.g., the integers, or the
|
||
strings of 3 characters). A hash-table algorithm needs to map
|
||
elements of U "uniformly" into the range [0,..., m -
|
||
1] (where m is a non-negative integral value, and
|
||
is, in general, time varying). I.e., the algorithm needs
|
||
a ranged-hash function</para>
|
||
|
||
<para>
|
||
f : U × Z<subscript>+</subscript> → Z<subscript>+</subscript>
|
||
</para>
|
||
|
||
<para>such that for any u in U ,</para>
|
||
|
||
<para>0 ≤ f(u, m) ≤ m - 1</para>
|
||
|
||
<para>and which has "good uniformity" properties (say
|
||
<xref linkend="biblio.knuth98sorting"/>.)
|
||
One
|
||
common solution is to use the composition of the hash
|
||
function</para>
|
||
|
||
<para>h : U → Z<subscript>+</subscript> ,</para>
|
||
|
||
<para>which maps elements of U into the non-negative
|
||
integrals, and</para>
|
||
|
||
<para>g : Z<subscript>+</subscript> × Z<subscript>+</subscript> →
|
||
Z<subscript>+</subscript>,</para>
|
||
|
||
<para>which maps a non-negative hash value, and a non-negative
|
||
range upper-bound into a non-negative integral in the range
|
||
between 0 (inclusive) and the range upper bound (exclusive),
|
||
i.e., for any r in Z<subscript>+</subscript>,</para>
|
||
|
||
<para>0 ≤ g(r, m) ≤ m - 1</para>
|
||
|
||
|
||
<para>The resulting ranged-hash function, is</para>
|
||
|
||
<!-- ranged_hash_composed_of_hash_and_range_hashing -->
|
||
<equation>
|
||
<title>Ranged Hash Function</title>
|
||
<mathphrase>
|
||
f(u , m) = g(h(u), m)
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
<para>From the above, it is obvious that given g and
|
||
h, f can always be composed (however the converse
|
||
is not true). The standard's hash-based containers allow specifying
|
||
a hash function, and use a hard-wired range-hashing function;
|
||
the ranged-hash function is implicitly composed.</para>
|
||
|
||
<para>The above describes the case where a key is to be mapped
|
||
into a single position within a hash table, e.g.,
|
||
in a collision-chaining table. In other cases, a key is to be
|
||
mapped into a sequence of positions within a table,
|
||
e.g., in a probing table. Similar terms apply in this
|
||
case: the table requires a ranged probe function,
|
||
mapping a key into a sequence of positions withing the table.
|
||
This is typically achieved by composing a hash function
|
||
mapping the key into a non-negative integral type, a
|
||
probe function transforming the hash value into a
|
||
sequence of hash values, and a range-hashing function
|
||
transforming the sequence of hash values into a sequence of
|
||
positions.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="details.hash_policies.range">
|
||
<info><title>Range Hashing</title></info>
|
||
|
||
<para>Some common choices for range-hashing functions are the
|
||
division, multiplication, and middle-square methods (<xref linkend="biblio.knuth98sorting"/>), defined
|
||
as</para>
|
||
|
||
<equation>
|
||
<title>Range-Hashing, Division Method</title>
|
||
<mathphrase>
|
||
g(r, m) = r mod m
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
|
||
|
||
<para>g(r, m) = ⌈ u/v ( a r mod v ) ⌉</para>
|
||
|
||
<para>and</para>
|
||
|
||
<para>g(r, m) = ⌈ u/v ( r<superscript>2</superscript> mod v ) ⌉</para>
|
||
|
||
<para>respectively, for some positive integrals u and
|
||
v (typically powers of 2), and some a. Each of
|
||
these range-hashing functions works best for some different
|
||
setting.</para>
|
||
|
||
<para>The division method (see above) is a
|
||
very common choice. However, even this single method can be
|
||
implemented in two very different ways. It is possible to
|
||
implement using the low
|
||
level % (modulo) operation (for any m), or the
|
||
low level & (bit-mask) operation (for the case where
|
||
m is a power of 2), i.e.,</para>
|
||
|
||
<equation>
|
||
<title>Division via Prime Modulo</title>
|
||
<mathphrase>
|
||
g(r, m) = r % m
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
<para>and</para>
|
||
|
||
<equation>
|
||
<title>Division via Bit Mask</title>
|
||
<mathphrase>
|
||
g(r, m) = r & m - 1, (with m =
|
||
2<superscript>k</superscript> for some k)
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
|
||
<para>respectively.</para>
|
||
|
||
<para>The % (modulo) implementation has the advantage that for
|
||
m a prime far from a power of 2, g(r, m) is
|
||
affected by all the bits of r (minimizing the chance of
|
||
collision). It has the disadvantage of using the costly modulo
|
||
operation. This method is hard-wired into SGI's implementation
|
||
.</para>
|
||
|
||
<para>The & (bit-mask) implementation has the advantage of
|
||
relying on the fast bit-wise and operation. It has the
|
||
disadvantage that for g(r, m) is affected only by the
|
||
low order bits of r. This method is hard-wired into
|
||
Dinkumware's implementation.</para>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="details.hash_policies.ranged">
|
||
<info><title>Ranged Hash</title></info>
|
||
|
||
<para>In cases it is beneficial to allow the
|
||
client to directly specify a ranged-hash hash function. It is
|
||
true, that the writer of the ranged-hash function cannot rely
|
||
on the values of m having specific numerical properties
|
||
suitable for hashing (in the sense used in <xref linkend="biblio.knuth98sorting"/>), since
|
||
the values of m are determined by a resize policy with
|
||
possibly orthogonal considerations.</para>
|
||
|
||
<para>There are two cases where a ranged-hash function can be
|
||
superior. The firs is when using perfect hashing: the
|
||
second is when the values of m can be used to estimate
|
||
the "general" number of distinct values required. This is
|
||
described in the following.</para>
|
||
|
||
<para>Let</para>
|
||
|
||
<para>
|
||
s = [ s<subscript>0</subscript>,..., s<subscript>t - 1</subscript>]
|
||
</para>
|
||
|
||
<para>be a string of t characters, each of which is from
|
||
domain S. Consider the following ranged-hash
|
||
function:</para>
|
||
<equation>
|
||
<title>
|
||
A Standard String Hash Function
|
||
</title>
|
||
<mathphrase>
|
||
f<subscript>1</subscript>(s, m) = ∑ <subscript>i =
|
||
0</subscript><superscript>t - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
|
||
<para>where a is some non-negative integral value. This is
|
||
the standard string-hashing function used in SGI's
|
||
implementation (with a = 5). Its advantage is that
|
||
it takes into account all of the characters of the string.</para>
|
||
|
||
<para>Now assume that s is the string representation of a
|
||
of a long DNA sequence (and so S = {'A', 'C', 'G',
|
||
'T'}). In this case, scanning the entire string might be
|
||
prohibitively expensive. A possible alternative might be to use
|
||
only the first k characters of the string, where</para>
|
||
|
||
<para>|S|<superscript>k</superscript> ≥ m ,</para>
|
||
|
||
<para>i.e., using the hash function</para>
|
||
|
||
<equation>
|
||
<title>
|
||
Only k String DNA Hash
|
||
</title>
|
||
<mathphrase>
|
||
f<subscript>2</subscript>(s, m) = ∑ <subscript>i
|
||
= 0</subscript><superscript>k - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
<para>requiring scanning over only</para>
|
||
|
||
<para>k = log<subscript>4</subscript>( m )</para>
|
||
|
||
<para>characters.</para>
|
||
|
||
<para>Other more elaborate hash-functions might scan k
|
||
characters starting at a random position (determined at each
|
||
resize), or scanning k random positions (determined at
|
||
each resize), i.e., using</para>
|
||
|
||
<para>f<subscript>3</subscript>(s, m) = ∑ <subscript>i =
|
||
r</subscript>0<superscript>r<subscript>0</subscript> + k - 1</superscript> s<subscript>i</subscript>
|
||
a<superscript>i</superscript> mod m ,</para>
|
||
|
||
<para>or</para>
|
||
|
||
<para>f<subscript>4</subscript>(s, m) = ∑ <subscript>i = 0</subscript><superscript>k -
|
||
1</superscript> s<subscript>r</subscript>i a<superscript>r<subscript>i</subscript></superscript> mod
|
||
m ,</para>
|
||
|
||
<para>respectively, for r<subscript>0</subscript>,..., r<subscript>k-1</subscript>
|
||
each in the (inclusive) range [0,...,t-1].</para>
|
||
|
||
<para>It should be noted that the above functions cannot be
|
||
decomposed as per a ranged hash composed of hash and range hashing.</para>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="details.hash_policies.implementation">
|
||
<info><title>Implementation</title></info>
|
||
|
||
<para>This sub-subsection describes the implementation of
|
||
the above in this library. It first explains range-hashing
|
||
functions in collision-chaining tables, then ranged-hash
|
||
functions in collision-chaining tables, then probing-based
|
||
tables, and finally lists the relevant classes in this
|
||
library.</para>
|
||
|
||
<section xml:id="hash_policies.implementation.collision-chaining">
|
||
<info><title>
|
||
Range-Hashing and Ranged-Hashes in Collision-Chaining Tables
|
||
</title></info>
|
||
|
||
|
||
<para><classname>cc_hash_table</classname> is
|
||
parametrized by <classname>Hash_Fn</classname> and <classname>Comb_Hash_Fn</classname>, a
|
||
hash functor and a combining hash functor, respectively.</para>
|
||
|
||
<para>In general, <classname>Comb_Hash_Fn</classname> is considered a
|
||
range-hashing functor. <classname>cc_hash_table</classname>
|
||
synthesizes a ranged-hash function from <classname>Hash_Fn</classname> and
|
||
<classname>Comb_Hash_Fn</classname>. The figure below shows an <classname>insert</classname> sequence
|
||
diagram for this case. The user inserts an element (point A),
|
||
the container transforms the key into a non-negative integral
|
||
using the hash functor (points B and C), and transforms the
|
||
result into a position using the combining functor (points D
|
||
and E).</para>
|
||
|
||
<figure>
|
||
<title>Insert hash sequence diagram</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_hash_range_hashing_seq_diagram.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Insert hash sequence diagram</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>If <classname>cc_hash_table</classname>'s
|
||
hash-functor, <classname>Hash_Fn</classname> is instantiated by <classname>null_type</classname> , then <classname>Comb_Hash_Fn</classname> is taken to be
|
||
a ranged-hash function. The graphic below shows an <function>insert</function> sequence
|
||
diagram. The user inserts an element (point A), the container
|
||
transforms the key into a position using the combining functor
|
||
(points B and C).</para>
|
||
|
||
<figure>
|
||
<title>Insert hash sequence diagram with a null policy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_hash_range_hashing_seq_diagram2.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Insert hash sequence diagram with a null policy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
</section>
|
||
|
||
<section xml:id="hash_policies.implementation.probe">
|
||
<info><title>
|
||
Probing tables
|
||
</title></info>
|
||
<para><classname>gp_hash_table</classname> is parametrized by
|
||
<classname>Hash_Fn</classname>, <classname>Probe_Fn</classname>,
|
||
and <classname>Comb_Probe_Fn</classname>. As before, if
|
||
<classname>Hash_Fn</classname> and <classname>Probe_Fn</classname>
|
||
are both <classname>null_type</classname>, then
|
||
<classname>Comb_Probe_Fn</classname> is a ranged-probe
|
||
functor. Otherwise, <classname>Hash_Fn</classname> is a hash
|
||
functor, <classname>Probe_Fn</classname> is a functor for offsets
|
||
from a hash value, and <classname>Comb_Probe_Fn</classname>
|
||
transforms a probe sequence into a sequence of positions within
|
||
the table.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="hash_policies.implementation.predefined">
|
||
<info><title>
|
||
Pre-Defined Policies
|
||
</title></info>
|
||
|
||
<para>This library contains some pre-defined classes
|
||
implementing range-hashing and probing functions:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>direct_mask_range_hashing</classname>
|
||
and <classname>direct_mod_range_hashing</classname>
|
||
are range-hashing functions based on a bit-mask and a modulo
|
||
operation, respectively.</para></listitem>
|
||
|
||
<listitem><para><classname>linear_probe_fn</classname>, and
|
||
<classname>quadratic_probe_fn</classname> are
|
||
a linear probe and a quadratic probe function,
|
||
respectively.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>
|
||
The graphic below shows the relationships.
|
||
</para>
|
||
<figure>
|
||
<title>Hash policy class diagram</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_hash_policy_cd.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Hash policy class diagram</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
</section>
|
||
|
||
</section> <!-- impl -->
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.hash.details.resize_policies">
|
||
<info><title>Resize Policies</title></info>
|
||
|
||
<section xml:id="resize_policies.general">
|
||
<info><title>General</title></info>
|
||
|
||
<para>Hash-tables, as opposed to trees, do not naturally grow or
|
||
shrink. It is necessary to specify policies to determine how
|
||
and when a hash table should change its size. Usually, resize
|
||
policies can be decomposed into orthogonal policies:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>A size policy indicating how a hash table
|
||
should grow (e.g., it should multiply by powers of
|
||
2).</para></listitem>
|
||
|
||
<listitem><para>A trigger policy indicating when a hash
|
||
table should grow (e.g., a load factor is
|
||
exceeded).</para></listitem>
|
||
</orderedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="resize_policies.size">
|
||
<info><title>Size Policies</title></info>
|
||
|
||
|
||
<para>Size policies determine how a hash table changes size. These
|
||
policies are simple, and there are relatively few sensible
|
||
options. An exponential-size policy (with the initial size and
|
||
growth factors both powers of 2) works well with a mask-based
|
||
range-hashing function, and is the
|
||
hard-wired policy used by Dinkumware. A
|
||
prime-list based policy works well with a modulo-prime range
|
||
hashing function and is the hard-wired policy used by SGI's
|
||
implementation.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="resize_policies.trigger">
|
||
<info><title>Trigger Policies</title></info>
|
||
|
||
<para>Trigger policies determine when a hash table changes size.
|
||
Following is a description of two policies: load-check
|
||
policies, and collision-check policies.</para>
|
||
|
||
<para>Load-check policies are straightforward. The user specifies
|
||
two factors, Α<subscript>min</subscript> and
|
||
Α<subscript>max</subscript>, and the hash table maintains the
|
||
invariant that</para>
|
||
|
||
<para>Α<subscript>min</subscript> ≤ (number of
|
||
stored elements) / (hash-table size) ≤
|
||
Α<subscript>max</subscript><remark>load factor min max</remark></para>
|
||
|
||
<para>Collision-check policies work in the opposite direction of
|
||
load-check policies. They focus on keeping the number of
|
||
collisions moderate and hoping that the size of the table will
|
||
not grow very large, instead of keeping a moderate load-factor
|
||
and hoping that the number of collisions will be small. A
|
||
maximal collision-check policy resizes when the longest
|
||
probe-sequence grows too large.</para>
|
||
|
||
<para>Consider the graphic below. Let the size of the hash table
|
||
be denoted by m, the length of a probe sequence be denoted by k,
|
||
and some load factor be denoted by Α. We would like to
|
||
calculate the minimal length of k, such that if there were Α
|
||
m elements in the hash table, a probe sequence of length k would
|
||
be found with probability at most 1/m.</para>
|
||
|
||
<figure>
|
||
<title>Balls and bins</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_balls_and_bins.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Balls and bins</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>Denote the probability that a probe sequence of length
|
||
k appears in bin i by p<subscript>i</subscript>, the
|
||
length of the probe sequence of bin i by
|
||
l<subscript>i</subscript>, and assume uniform distribution. Then</para>
|
||
|
||
|
||
|
||
<equation>
|
||
<title>
|
||
Probability of Probe Sequence of Length k
|
||
</title>
|
||
<mathphrase>
|
||
p<subscript>1</subscript> =
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
<para>P(l<subscript>1</subscript> ≥ k) =</para>
|
||
|
||
<para>
|
||
P(l<subscript>1</subscript> ≥ α ( 1 + k / α - 1) ≤ (a)
|
||
</para>
|
||
|
||
<para>
|
||
e ^ ( - ( α ( k / α - 1 )<superscript>2</superscript> ) /2)
|
||
</para>
|
||
|
||
<para>where (a) follows from the Chernoff bound (<xref linkend="biblio.motwani95random"/>). To
|
||
calculate the probability that some bin contains a probe
|
||
sequence greater than k, we note that the
|
||
l<subscript>i</subscript> are negatively-dependent
|
||
(<xref linkend="biblio.dubhashi98neg"/>)
|
||
. Let
|
||
I(.) denote the indicator function. Then</para>
|
||
|
||
<equation>
|
||
<title>
|
||
Probability Probe Sequence in Some Bin
|
||
</title>
|
||
<mathphrase>
|
||
P( exists<subscript>i</subscript> l<subscript>i</subscript> ≥ k ) =
|
||
</mathphrase>
|
||
</equation>
|
||
|
||
<para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript>
|
||
I(l<subscript>i</subscript> ≥ k) ≥ 1 ) =</para>
|
||
|
||
<para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript> I (
|
||
l<subscript>i</subscript> ≥ k ) ≥ m p<subscript>1</subscript> ( 1 + 1 / (m
|
||
p<subscript>1</subscript>) - 1 ) ) ≤ (a)</para>
|
||
|
||
<para>e ^ ( ( - m p<subscript>1</subscript> ( 1 / (m p<subscript>1</subscript>)
|
||
- 1 ) <superscript>2</superscript> ) / 2 ) ,</para>
|
||
|
||
<para>where (a) follows from the fact that the Chernoff bound can
|
||
be applied to negatively-dependent variables (<xref
|
||
linkend="biblio.dubhashi98neg"/>). Inserting the first probability
|
||
equation into the second one, and equating with 1/m, we
|
||
obtain</para>
|
||
|
||
|
||
<para>k ~ √ ( 2 α ln 2 m ln(m) )
|
||
) .</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="resize_policies.impl">
|
||
<info><title>Implementation</title></info>
|
||
|
||
<para>This sub-subsection describes the implementation of the
|
||
above in this library. It first describes resize policies and
|
||
their decomposition into trigger and size policies, then
|
||
describes pre-defined classes, and finally discusses controlled
|
||
access the policies' internals.</para>
|
||
|
||
<section xml:id="resize_policies.impl.decomposition">
|
||
<info><title>Decomposition</title></info>
|
||
|
||
|
||
<para>Each hash-based container is parametrized by a
|
||
<classname>Resize_Policy</classname> parameter; the container derives
|
||
<classname>public</classname>ly from <classname>Resize_Policy</classname>. For
|
||
example:</para>
|
||
<programlisting>
|
||
cc_hash_table<typename Key,
|
||
typename Mapped,
|
||
...
|
||
typename Resize_Policy
|
||
...> : public Resize_Policy
|
||
</programlisting>
|
||
|
||
<para>As a container object is modified, it continuously notifies
|
||
its <classname>Resize_Policy</classname> base of internal changes
|
||
(e.g., collisions encountered and elements being
|
||
inserted). It queries its <classname>Resize_Policy</classname> base whether
|
||
it needs to be resized, and if so, to what size.</para>
|
||
|
||
<para>The graphic below shows a (possible) sequence diagram
|
||
of an insert operation. The user inserts an element; the hash
|
||
table notifies its resize policy that a search has started
|
||
(point A); in this case, a single collision is encountered -
|
||
the table notifies its resize policy of this (point B); the
|
||
container finally notifies its resize policy that the search
|
||
has ended (point C); it then queries its resize policy whether
|
||
a resize is needed, and if so, what is the new size (points D
|
||
to G); following the resize, it notifies the policy that a
|
||
resize has completed (point H); finally, the element is
|
||
inserted, and the policy notified (point I).</para>
|
||
|
||
<figure>
|
||
<title>Insert resize sequence diagram</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_insert_resize_sequence_diagram1.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Insert resize sequence diagram</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
<para>In practice, a resize policy can be usually orthogonally
|
||
decomposed to a size policy and a trigger policy. Consequently,
|
||
the library contains a single class for instantiating a resize
|
||
policy: <classname>hash_standard_resize_policy</classname>
|
||
is parametrized by <classname>Size_Policy</classname> and
|
||
<classname>Trigger_Policy</classname>, derives <classname>public</classname>ly from
|
||
both, and acts as a standard delegate (<xref linkend="biblio.gof"/>)
|
||
to these policies.</para>
|
||
|
||
<para>The two graphics immediately below show sequence diagrams
|
||
illustrating the interaction between the standard resize policy
|
||
and its trigger and size policies, respectively.</para>
|
||
|
||
<figure>
|
||
<title>Standard resize policy trigger sequence
|
||
diagram</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_insert_resize_sequence_diagram2.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Standard resize policy trigger sequence
|
||
diagram</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<figure>
|
||
<title>Standard resize policy size sequence
|
||
diagram</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_insert_resize_sequence_diagram3.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Standard resize policy size sequence
|
||
diagram</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="resize_policies.impl.predefined">
|
||
<info><title>Predefined Policies</title></info>
|
||
<para>The library includes the following
|
||
instantiations of size and trigger policies:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>hash_load_check_resize_trigger</classname>
|
||
implements a load check trigger policy.</para></listitem>
|
||
|
||
<listitem><para><classname>cc_hash_max_collision_check_resize_trigger</classname>
|
||
implements a collision check trigger policy.</para></listitem>
|
||
|
||
<listitem><para><classname>hash_exponential_size_policy</classname>
|
||
implements an exponential-size policy (which should be used
|
||
with mask range hashing).</para></listitem>
|
||
|
||
<listitem><para><classname>hash_prime_size_policy</classname>
|
||
implementing a size policy based on a sequence of primes
|
||
(which should
|
||
be used with mod range hashing</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The graphic below gives an overall picture of the resize-related
|
||
classes. <classname>basic_hash_table</classname>
|
||
is parametrized by <classname>Resize_Policy</classname>, which it subclasses
|
||
publicly. This class is currently instantiated only by <classname>hash_standard_resize_policy</classname>.
|
||
<classname>hash_standard_resize_policy</classname>
|
||
itself is parametrized by <classname>Trigger_Policy</classname> and
|
||
<classname>Size_Policy</classname>. Currently, <classname>Trigger_Policy</classname> is
|
||
instantiated by <classname>hash_load_check_resize_trigger</classname>,
|
||
or <classname>cc_hash_max_collision_check_resize_trigger</classname>;
|
||
<classname>Size_Policy</classname> is instantiated by <classname>hash_exponential_size_policy</classname>,
|
||
or <classname>hash_prime_size_policy</classname>.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="resize_policies.impl.internals">
|
||
<info><title>Controling Access to Internals</title></info>
|
||
|
||
<para>There are cases where (controlled) access to resize
|
||
policies' internals is beneficial. E.g., it is sometimes
|
||
useful to query a hash-table for the table's actual size (as
|
||
opposed to its <function>size()</function> - the number of values it
|
||
currently holds); it is sometimes useful to set a table's
|
||
initial size, externally resize it, or change load factors.</para>
|
||
|
||
<para>Clearly, supporting such methods both decreases the
|
||
encapsulation of hash-based containers, and increases the
|
||
diversity between different associative-containers' interfaces.
|
||
Conversely, omitting such methods can decrease containers'
|
||
flexibility.</para>
|
||
|
||
<para>In order to avoid, to the extent possible, the above
|
||
conflict, the hash-based containers themselves do not address
|
||
any of these questions; this is deferred to the resize policies,
|
||
which are easier to change or replace. Thus, for example,
|
||
neither <classname>cc_hash_table</classname> nor
|
||
<classname>gp_hash_table</classname>
|
||
contain methods for querying the actual size of the table; this
|
||
is deferred to <classname>hash_standard_resize_policy</classname>.</para>
|
||
|
||
<para>Furthermore, the policies themselves are parametrized by
|
||
template arguments that determine the methods they support
|
||
(
|
||
<xref linkend="biblio.alexandrescu01modern"/>
|
||
shows techniques for doing so). <classname>hash_standard_resize_policy</classname>
|
||
is parametrized by <classname>External_Size_Access</classname> that
|
||
determines whether it supports methods for querying the actual
|
||
size of the table or resizing it. <classname>hash_load_check_resize_trigger</classname>
|
||
is parametrized by <classname>External_Load_Access</classname> that
|
||
determines whether it supports methods for querying or
|
||
modifying the loads. <classname>cc_hash_max_collision_check_resize_trigger</classname>
|
||
is parametrized by <classname>External_Load_Access</classname> that
|
||
determines whether it supports methods for querying the
|
||
load.</para>
|
||
|
||
<para>Some operations, for example, resizing a container at
|
||
run time, or changing the load factors of a load-check trigger
|
||
policy, require the container itself to resize. As mentioned
|
||
above, the hash-based containers themselves do not contain
|
||
these types of methods, only their resize policies.
|
||
Consequently, there must be some mechanism for a resize policy
|
||
to manipulate the hash-based container. As the hash-based
|
||
container is a subclass of the resize policy, this is done
|
||
through virtual methods. Each hash-based container has a
|
||
<classname>private</classname> <classname>virtual</classname> method:</para>
|
||
<programlisting>
|
||
virtual void
|
||
do_resize
|
||
(size_type new_size);
|
||
</programlisting>
|
||
|
||
<para>which resizes the container. Implementations of
|
||
<classname>Resize_Policy</classname> can export public methods for resizing
|
||
the container externally; these methods internally call
|
||
<classname>do_resize</classname> to resize the table.</para>
|
||
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
|
||
</section> <!-- resize policies -->
|
||
|
||
<section xml:id="container.hash.details.policy_interaction">
|
||
<info><title>Policy Interactions</title></info>
|
||
<para>
|
||
</para>
|
||
<para>Hash-tables are unfortunately especially susceptible to
|
||
choice of policies. One of the more complicated aspects of this
|
||
is that poor combinations of good policies can form a poor
|
||
container. Following are some considerations.</para>
|
||
|
||
<section xml:id="policy_interaction.probesizetrigger">
|
||
<info><title>probe/size/trigger</title></info>
|
||
|
||
<para>Some combinations do not work well for probing containers.
|
||
For example, combining a quadratic probe policy with an
|
||
exponential size policy can yield a poor container: when an
|
||
element is inserted, a trigger policy might decide that there
|
||
is no need to resize, as the table still contains unused
|
||
entries; the probe sequence, however, might never reach any of
|
||
the unused entries.</para>
|
||
|
||
<para>Unfortunately, this library cannot detect such problems at
|
||
compilation (they are halting reducible). It therefore defines
|
||
an exception class <classname>insert_error</classname> to throw an
|
||
exception in this case.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="policy_interaction.hashtrigger">
|
||
<info><title>hash/trigger</title></info>
|
||
|
||
<para>Some trigger policies are especially susceptible to poor
|
||
hash functions. Suppose, as an extreme case, that the hash
|
||
function transforms each key to the same hash value. After some
|
||
inserts, a collision detecting policy will always indicate that
|
||
the container needs to grow.</para>
|
||
|
||
<para>The library, therefore, by design, limits each operation to
|
||
one resize. For each <classname>insert</classname>, for example, it queries
|
||
only once whether a resize is needed.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="policy_interaction.eqstorehash">
|
||
<info><title>equivalence functors/storing hash values/hash</title></info>
|
||
|
||
<para><classname>cc_hash_table</classname> and
|
||
<classname>gp_hash_table</classname> are
|
||
parametrized by an equivalence functor and by a
|
||
<classname>Store_Hash</classname> parameter. If the latter parameter is
|
||
<classname>true</classname>, then the container stores with each entry
|
||
a hash value, and uses this value in case of collisions to
|
||
determine whether to apply a hash value. This can lower the
|
||
cost of collision for some types, but increase the cost of
|
||
collisions for other types.</para>
|
||
|
||
<para>If a ranged-hash function or ranged probe function is
|
||
directly supplied, however, then it makes no sense to store the
|
||
hash value with each entry. This library's container will
|
||
fail at compilation, by design, if this is attempted.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="policy_interaction.sizeloadtrigger">
|
||
<info><title>size/load-check trigger</title></info>
|
||
|
||
<para>Assume a size policy issues an increasing sequence of sizes
|
||
a, a q, a q<superscript>1</superscript>, a q<superscript>2</superscript>, ... For
|
||
example, an exponential size policy might issue the sequence of
|
||
sizes 8, 16, 32, 64, ...</para>
|
||
|
||
<para>If a load-check trigger policy is used, with loads
|
||
α<subscript>min</subscript> and α<subscript>max</subscript>,
|
||
respectively, then it is a good idea to have:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>α<subscript>max</subscript> ~ 1 / q</para></listitem>
|
||
|
||
<listitem><para>α<subscript>min</subscript> < 1 / (2 q)</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>This will ensure that the amortized hash cost of each
|
||
modifying operation is at most approximately 3.</para>
|
||
|
||
<para>α<subscript>min</subscript> ~ α<subscript>max</subscript> is, in
|
||
any case, a bad choice, and α<subscript>min</subscript> >
|
||
α <subscript>max</subscript> is horrendous.</para>
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
</section> <!-- details -->
|
||
|
||
</section> <!-- hash -->
|
||
|
||
<!-- tree -->
|
||
<section xml:id="pbds.design.container.tree">
|
||
<info><title>tree</title></info>
|
||
|
||
<section xml:id="container.tree.interface">
|
||
<info><title>Interface</title></info>
|
||
|
||
<para>The tree-based container has the following declaration:</para>
|
||
<programlisting>
|
||
template<
|
||
typename Key,
|
||
typename Mapped,
|
||
typename Cmp_Fn = std::less<Key>,
|
||
typename Tag = rb_tree_tag,
|
||
template<
|
||
typename Const_Node_Iterator,
|
||
typename Node_Iterator,
|
||
typename Cmp_Fn_,
|
||
typename Allocator_>
|
||
class Node_Update = null_node_update,
|
||
typename Allocator = std::allocator<char> >
|
||
class tree;
|
||
</programlisting>
|
||
|
||
<para>The parameters have the following meaning:</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para><classname>Key</classname> is the key type.</para></listitem>
|
||
|
||
<listitem>
|
||
<para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
|
||
|
||
<listitem>
|
||
<para><classname>Cmp_Fn</classname> is a key comparison functor</para></listitem>
|
||
|
||
<listitem>
|
||
<para><classname>Tag</classname> specifies which underlying data structure
|
||
to use.</para></listitem>
|
||
|
||
<listitem>
|
||
<para><classname>Node_Update</classname> is a policy for updating node
|
||
invariants.</para></listitem>
|
||
|
||
<listitem>
|
||
<para><classname>Allocator</classname> is an allocator
|
||
type.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The <classname>Tag</classname> parameter specifies which underlying
|
||
data structure to use. Instantiating it by <classname>rb_tree_tag</classname>, <classname>splay_tree_tag</classname>, or
|
||
<classname>ov_tree_tag</classname>,
|
||
specifies an underlying red-black tree, splay tree, or
|
||
ordered-vector tree, respectively; any other tag is illegal.
|
||
Note that containers based on the former two contain more types
|
||
and methods than the latter (e.g.,
|
||
<classname>reverse_iterator</classname> and <classname>rbegin</classname>), and different
|
||
exception and invalidation guarantees.</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.tree.details">
|
||
<info><title>Details</title></info>
|
||
|
||
<section xml:id="container.tree.node">
|
||
<info><title>Node Invariants</title></info>
|
||
|
||
|
||
<para>Consider the two trees in the graphic below, labels A and B. The first
|
||
is a tree of floats; the second is a tree of pairs, each
|
||
signifying a geometric line interval. Each element in a tree is referred to as a node of the tree. Of course, each of
|
||
these trees can support the usual queries: the first can easily
|
||
search for <classname>0.4</classname>; the second can easily search for
|
||
<classname>std::make_pair(10, 41)</classname>.</para>
|
||
|
||
<para>Each of these trees can efficiently support other queries.
|
||
The first can efficiently determine that the 2rd key in the
|
||
tree is <constant>0.3</constant>; the second can efficiently determine
|
||
whether any of its intervals overlaps
|
||
<programlisting>std::make_pair(29,42)</programlisting> (useful in geometric
|
||
applications or distributed file systems with leases, for
|
||
example). It should be noted that an <classname>std::set</classname> can
|
||
only solve these types of problems with linear complexity.</para>
|
||
|
||
<para>In order to do so, each tree stores some metadata in
|
||
each node, and maintains node invariants (see <xref linkend="biblio.clrs2001"/>.) The first stores in
|
||
each node the size of the sub-tree rooted at the node; the
|
||
second stores at each node the maximal endpoint of the
|
||
intervals at the sub-tree rooted at the node.</para>
|
||
|
||
<figure>
|
||
<title>Tree node invariants</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_tree_node_invariants.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Tree node invariants</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>Supporting such trees is difficult for a number of
|
||
reasons:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>There must be a way to specify what a node's metadata
|
||
should be (if any).</para></listitem>
|
||
|
||
<listitem><para>Various operations can invalidate node
|
||
invariants. The graphic below shows how a right rotation,
|
||
performed on A, results in B, with nodes x and y having
|
||
corrupted invariants (the grayed nodes in C). The graphic shows
|
||
how an insert, performed on D, results in E, with nodes x and y
|
||
having corrupted invariants (the grayed nodes in F). It is not
|
||
feasible to know outside the tree the effect of an operation on
|
||
the nodes of the tree.</para></listitem>
|
||
|
||
<listitem><para>The search paths of standard associative containers are
|
||
defined by comparisons between keys, and not through
|
||
metadata.</para></listitem>
|
||
|
||
<listitem><para>It is not feasible to know in advance which methods trees
|
||
can support. Besides the usual <classname>find</classname> method, the
|
||
first tree can support a <classname>find_by_order</classname> method, while
|
||
the second can support an <classname>overlaps</classname> method.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<figure>
|
||
<title>Tree node invalidation</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_tree_node_invalidations.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Tree node invalidation</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>These problems are solved by a combination of two means:
|
||
node iterators, and template-template node updater
|
||
parameters.</para>
|
||
|
||
<section xml:id="container.tree.node.iterators">
|
||
<info><title>Node Iterators</title></info>
|
||
|
||
|
||
<para>Each tree-based container defines two additional iterator
|
||
types, <classname>const_node_iterator</classname>
|
||
and <classname>node_iterator</classname>.
|
||
These iterators allow descending from a node to one of its
|
||
children. Node iterator allow search paths different than those
|
||
determined by the comparison functor. The <classname>tree</classname>
|
||
supports the methods:</para>
|
||
<programlisting>
|
||
const_node_iterator
|
||
node_begin() const;
|
||
|
||
node_iterator
|
||
node_begin();
|
||
|
||
const_node_iterator
|
||
node_end() const;
|
||
|
||
node_iterator
|
||
node_end();
|
||
</programlisting>
|
||
|
||
<para>The first pairs return node iterators corresponding to the
|
||
root node of the tree; the latter pair returns node iterators
|
||
corresponding to a just-after-leaf node.</para>
|
||
</section>
|
||
|
||
<section xml:id="container.tree.node.updator">
|
||
<info><title>Node Updator</title></info>
|
||
|
||
<para>The tree-based containers are parametrized by a
|
||
<classname>Node_Update</classname> template-template parameter. A
|
||
tree-based container instantiates
|
||
<classname>Node_Update</classname> to some
|
||
<classname>node_update</classname> class, and publicly subclasses
|
||
<classname>node_update</classname>. The graphic below shows this
|
||
scheme, as well as some predefined policies (which are explained
|
||
below).</para>
|
||
|
||
<figure>
|
||
<title>A tree and its update policy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_tree_node_updator_policy_cd.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>A tree and its update policy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para><classname>node_update</classname> (an instantiation of
|
||
<classname>Node_Update</classname>) must define <classname>metadata_type</classname> as
|
||
the type of metadata it requires. For order statistics,
|
||
e.g., <classname>metadata_type</classname> might be <classname>size_t</classname>.
|
||
The tree defines within each node a <classname>metadata_type</classname>
|
||
object.</para>
|
||
|
||
<para><classname>node_update</classname> must also define the following method
|
||
for restoring node invariants:</para>
|
||
<programlisting>
|
||
void
|
||
operator()(node_iterator nd_it, const_node_iterator end_nd_it)
|
||
</programlisting>
|
||
|
||
<para>In this method, <varname>nd_it</varname> is a
|
||
<classname>node_iterator</classname> corresponding to a node whose
|
||
A) all descendants have valid invariants, and B) its own
|
||
invariants might be violated; <classname>end_nd_it</classname> is
|
||
a <classname>const_node_iterator</classname> corresponding to a
|
||
just-after-leaf node. This method should correct the node
|
||
invariants of the node pointed to by
|
||
<classname>nd_it</classname>. For example, say node x in the
|
||
graphic below label A has an invalid invariant, but its' children,
|
||
y and z have valid invariants. After the invocation, all three
|
||
nodes should have valid invariants, as in label B.</para>
|
||
|
||
|
||
<figure>
|
||
<title>Restoring node invariants</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_restoring_node_invariants.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Restoring node invariants</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>When a tree operation might invalidate some node invariant,
|
||
it invokes this method in its <classname>node_update</classname> base to
|
||
restore the invariant. For example, the graphic below shows
|
||
an <function>insert</function> operation (point A); the tree performs some
|
||
operations, and calls the update functor three times (points B,
|
||
C, and D). (It is well known that any <function>insert</function>,
|
||
<function>erase</function>, <function>split</function> or <function>join</function>, can restore
|
||
all node invariants by a small number of node invariant updates (<xref linkend="biblio.clrs2001"/>)
|
||
.</para>
|
||
|
||
<figure>
|
||
<title>Insert update sequence</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_update_seq_diagram.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Insert update sequence</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>To complete the description of the scheme, three questions
|
||
need to be answered:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>How can a tree which supports order statistics define a
|
||
method such as <classname>find_by_order</classname>?</para></listitem>
|
||
|
||
<listitem><para>How can the node updater base access methods of the
|
||
tree?</para></listitem>
|
||
|
||
<listitem><para>How can the following cyclic dependency be resolved?
|
||
<classname>node_update</classname> is a base class of the tree, yet it
|
||
uses node iterators defined in the tree (its child).</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The first two questions are answered by the fact that
|
||
<classname>node_update</classname> (an instantiation of
|
||
<classname>Node_Update</classname>) is a <emphasis>public</emphasis> base class
|
||
of the tree. Consequently:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>Any public methods of
|
||
<classname>node_update</classname> are automatically methods of
|
||
the tree (<xref linkend="biblio.alexandrescu01modern"/>).
|
||
Thus an order-statistics node updater,
|
||
<classname>tree_order_statistics_node_update</classname> defines
|
||
the <function>find_by_order</function> method; any tree
|
||
instantiated by this policy consequently supports this method as
|
||
well.</para></listitem>
|
||
|
||
<listitem><para>In C++, if a base class declares a method as
|
||
<literal>virtual</literal>, it is
|
||
<literal>virtual</literal> in its subclasses. If
|
||
<classname>node_update</classname> needs to access one of the
|
||
tree's methods, say the member function
|
||
<function>end</function>, it simply declares that method as
|
||
<literal>virtual</literal> abstract.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The cyclic dependency is solved through template-template
|
||
parameters. <classname>Node_Update</classname> is parametrized by
|
||
the tree's node iterators, its comparison functor, and its
|
||
allocator type. Thus, instantiations of
|
||
<classname>Node_Update</classname> have all information
|
||
required.</para>
|
||
|
||
<para>This library assumes that constructing a metadata object and
|
||
modifying it are exception free. Suppose that during some method,
|
||
say <classname>insert</classname>, a metadata-related operation
|
||
(e.g., changing the value of a metadata) throws an exception. Ack!
|
||
Rolling back the method is unusually complex.</para>
|
||
|
||
<para>Previously, a distinction was made between redundant
|
||
policies and null policies. Node invariants show a
|
||
case where null policies are required.</para>
|
||
|
||
<para>Assume a regular tree is required, one which need not
|
||
support order statistics or interval overlap queries.
|
||
Seemingly, in this case a redundant policy - a policy which
|
||
doesn't affect nodes' contents would suffice. This, would lead
|
||
to the following drawbacks:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>Each node would carry a useless metadata object, wasting
|
||
space.</para></listitem>
|
||
|
||
<listitem><para>The tree cannot know if its
|
||
<classname>Node_Update</classname> policy actually modifies a
|
||
node's metadata (this is halting reducible). In the graphic
|
||
below, assume the shaded node is inserted. The tree would have
|
||
to traverse the useless path shown to the root, applying
|
||
redundant updates all the way.</para></listitem>
|
||
</orderedlist>
|
||
<figure>
|
||
<title>Useless update path</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_rationale_null_node_updator.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Useless update path</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
<para>A null policy class, <classname>null_node_update</classname>
|
||
solves both these problems. The tree detects that node
|
||
invariants are irrelevant, and defines all accordingly.</para>
|
||
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.tree.details.split">
|
||
<info><title>Split and Join</title></info>
|
||
|
||
<para>Tree-based containers support split and join methods.
|
||
It is possible to split a tree so that it passes
|
||
all nodes with keys larger than a given key to a different
|
||
tree. These methods have the following advantages over the
|
||
alternative of externally inserting to the destination
|
||
tree and erasing from the source tree:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>These methods are efficient - red-black trees are split
|
||
and joined in poly-logarithmic complexity; ordered-vector
|
||
trees are split and joined at linear complexity. The
|
||
alternatives have super-linear complexity.</para></listitem>
|
||
|
||
<listitem><para>Aside from orders of growth, these operations perform
|
||
few allocations and de-allocations. For red-black trees, allocations are not performed,
|
||
and the methods are exception-free. </para></listitem>
|
||
</orderedlist>
|
||
</section>
|
||
|
||
</section> <!-- details -->
|
||
|
||
</section> <!-- tree -->
|
||
|
||
<!-- trie -->
|
||
<section xml:id="pbds.design.container.trie">
|
||
<info><title>Trie</title></info>
|
||
|
||
<section xml:id="container.trie.interface">
|
||
<info><title>Interface</title></info>
|
||
|
||
<para>The trie-based container has the following declaration:</para>
|
||
<programlisting>
|
||
template<typename Key,
|
||
typename Mapped,
|
||
typename Cmp_Fn = std::less<Key>,
|
||
typename Tag = pat_trie_tag,
|
||
template<typename Const_Node_Iterator,
|
||
typename Node_Iterator,
|
||
typename E_Access_Traits_,
|
||
typename Allocator_>
|
||
class Node_Update = null_node_update,
|
||
typename Allocator = std::allocator<char> >
|
||
class trie;
|
||
</programlisting>
|
||
|
||
<para>The parameters have the following meaning:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>Key</classname> is the key type.</para></listitem>
|
||
|
||
<listitem><para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
|
||
|
||
<listitem><para><classname>E_Access_Traits</classname> is described in below.</para></listitem>
|
||
|
||
<listitem><para><classname>Tag</classname> specifies which underlying data structure
|
||
to use, and is described shortly.</para></listitem>
|
||
|
||
<listitem><para><classname>Node_Update</classname> is a policy for updating node
|
||
invariants. This is described below.</para></listitem>
|
||
|
||
<listitem><para><classname>Allocator</classname> is an allocator
|
||
type.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The <classname>Tag</classname> parameter specifies which underlying
|
||
data structure to use. Instantiating it by <classname>pat_trie_tag</classname>, specifies an
|
||
underlying PATRICIA trie (explained shortly); any other tag is
|
||
currently illegal.</para>
|
||
|
||
<para>Following is a description of a (PATRICIA) trie
|
||
(this implementation follows <xref linkend="biblio.okasaki98mereable"/> and
|
||
<xref linkend="biblio.filliatre2000ptset"/>).
|
||
</para>
|
||
|
||
<para>A (PATRICIA) trie is similar to a tree, but with the
|
||
following differences:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>It explicitly views keys as a sequence of elements.
|
||
E.g., a trie can view a string as a sequence of
|
||
characters; a trie can view a number as a sequence of
|
||
bits.</para></listitem>
|
||
|
||
<listitem><para>It is not (necessarily) binary. Each node has fan-out n
|
||
+ 1, where n is the number of distinct
|
||
elements.</para></listitem>
|
||
|
||
<listitem><para>It stores values only at leaf nodes.</para></listitem>
|
||
|
||
<listitem><para>Internal nodes have the properties that A) each has at
|
||
least two children, and B) each shares the same prefix with
|
||
any of its descendant.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>A (PATRICIA) trie has some useful properties:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>It can be configured to use large node fan-out, giving it
|
||
very efficient find performance (albeit at insertion
|
||
complexity and size).</para></listitem>
|
||
|
||
<listitem><para>It works well for common-prefix keys.</para></listitem>
|
||
|
||
<listitem><para>It can support efficiently queries such as which
|
||
keys match a certain prefix. This is sometimes useful in file
|
||
systems and routers, and for "type-ahead" aka predictive text matching
|
||
on mobile devices.</para></listitem>
|
||
</orderedlist>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.trie.details">
|
||
<info><title>Details</title></info>
|
||
|
||
<section xml:id="container.trie.details.etraits">
|
||
<info><title>Element Access Traits</title></info>
|
||
|
||
<para>A trie inherently views its keys as sequences of elements.
|
||
For example, a trie can view a string as a sequence of
|
||
characters. A trie needs to map each of n elements to a
|
||
number in {0, n - 1}. For example, a trie can map a
|
||
character <varname>c</varname> to
|
||
<programlisting>static_cast<size_t>(c)</programlisting>.</para>
|
||
|
||
<para>Seemingly, then, a trie can assume that its keys support
|
||
(const) iterators, and that the <classname>value_type</classname> of this
|
||
iterator can be cast to a <classname>size_t</classname>. There are several
|
||
reasons, though, to decouple the mechanism by which the trie
|
||
accesses its keys' elements from the trie:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>In some cases, the numerical value of an element is
|
||
inappropriate. Consider a trie storing DNA strings. It is
|
||
logical to use a trie with a fan-out of 5 = 1 + |{'A', 'C',
|
||
'G', 'T'}|. This requires mapping 'T' to 3, though.</para></listitem>
|
||
|
||
<listitem><para>In some cases the keys' iterators are different than what
|
||
is needed. For example, a trie can be used to search for
|
||
common suffixes, by using strings'
|
||
<classname>reverse_iterator</classname>. As another example, a trie mapping
|
||
UNICODE strings would have a huge fan-out if each node would
|
||
branch on a UNICODE character; instead, one can define an
|
||
iterator iterating over 8-bit (or less) groups.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>trie is,
|
||
consequently, parametrized by <classname>E_Access_Traits</classname> -
|
||
traits which instruct how to access sequences' elements.
|
||
<classname>string_trie_e_access_traits</classname>
|
||
is a traits class for strings. Each such traits define some
|
||
types, like:</para>
|
||
<programlisting>
|
||
typename E_Access_Traits::const_iterator
|
||
</programlisting>
|
||
|
||
<para>is a const iterator iterating over a key's elements. The
|
||
traits class must also define methods for obtaining an iterator
|
||
to the first and last element of a key.</para>
|
||
|
||
<para>The graphic below shows a
|
||
(PATRICIA) trie resulting from inserting the words: "I wish
|
||
that I could ever see a poem lovely as a trie" (which,
|
||
unfortunately, does not rhyme).</para>
|
||
|
||
<para>The leaf nodes contain values; each internal node contains
|
||
two <classname>typename E_Access_Traits::const_iterator</classname>
|
||
objects, indicating the maximal common prefix of all keys in
|
||
the sub-tree. For example, the shaded internal node roots a
|
||
sub-tree with leafs "a" and "as". The maximal common prefix is
|
||
"a". The internal node contains, consequently, to const
|
||
iterators, one pointing to <varname>'a'</varname>, and the other to
|
||
<varname>'s'</varname>.</para>
|
||
|
||
<figure>
|
||
<title>A PATRICIA trie</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_pat_trie.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>A PATRICIA trie</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.trie.details.node">
|
||
<info><title>Node Invariants</title></info>
|
||
|
||
<para>Trie-based containers support node invariants, as do
|
||
tree-based containers. There are two minor
|
||
differences, though, which, unfortunately, thwart sharing them
|
||
sharing the same node-updating policies:</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>A trie's <classname>Node_Update</classname> template-template
|
||
parameter is parametrized by <classname>E_Access_Traits</classname>, while
|
||
a tree's <classname>Node_Update</classname> template-template parameter is
|
||
parametrized by <classname>Cmp_Fn</classname>.</para></listitem>
|
||
|
||
<listitem><para>Tree-based containers store values in all nodes, while
|
||
trie-based containers (at least in this implementation) store
|
||
values in leafs.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The graphic below shows the scheme, as well as some predefined
|
||
policies (which are explained below).</para>
|
||
|
||
<figure>
|
||
<title>A trie and its update policy</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_trie_node_updator_policy_cd.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>A trie and its update policy</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
<para>This library offers the following pre-defined trie node
|
||
updating policies:</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>trie_order_statistics_node_update</classname>
|
||
supports order statistics.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem><para><classname>trie_prefix_search_node_update</classname>
|
||
supports searching for ranges that match a given prefix.</para></listitem>
|
||
|
||
<listitem><para><classname>null_node_update</classname>
|
||
is the null node updater.</para></listitem>
|
||
</orderedlist>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.trie.details.split">
|
||
<info><title>Split and Join</title></info>
|
||
<para>Trie-based containers support split and join methods; the
|
||
rationale is equal to that of tree-based containers supporting
|
||
these methods.</para>
|
||
</section>
|
||
|
||
</section> <!-- details -->
|
||
|
||
</section> <!-- trie -->
|
||
|
||
<!-- list_update -->
|
||
<section xml:id="pbds.design.container.list">
|
||
<info><title>List</title></info>
|
||
|
||
<section xml:id="container.list.interface">
|
||
<info><title>Interface</title></info>
|
||
|
||
<para>The list-based container has the following declaration:</para>
|
||
<programlisting>
|
||
template<typename Key,
|
||
typename Mapped,
|
||
typename Eq_Fn = std::equal_to<Key>,
|
||
typename Update_Policy = move_to_front_lu_policy<>,
|
||
typename Allocator = std::allocator<char> >
|
||
class list_update;
|
||
</programlisting>
|
||
|
||
<para>The parameters have the following meaning:</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
<classname>Key</classname> is the key type.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>Mapped</classname> is the mapped-policy.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>Eq_Fn</classname> is a key equivalence functor.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>Update_Policy</classname> is a policy updating positions in
|
||
the list based on access patterns. It is described in the
|
||
following subsection.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
<classname>Allocator</classname> is an allocator type.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<para>A list-based associative container is a container that
|
||
stores elements in a linked-list. It does not order the elements
|
||
by any particular order related to the keys. List-based
|
||
containers are primarily useful for creating "multimaps". In fact,
|
||
list-based containers are designed in this library expressly for
|
||
this purpose.</para>
|
||
|
||
<para>List-based containers might also be useful for some rare
|
||
cases, where a key is encapsulated to the extent that only
|
||
key-equivalence can be tested. Hash-based containers need to know
|
||
how to transform a key into a size type, and tree-based containers
|
||
need to know if some key is larger than another. List-based
|
||
associative containers, conversely, only need to know if two keys
|
||
are equivalent.</para>
|
||
|
||
<para>Since a list-based associative container does not order
|
||
elements by keys, is it possible to order the list in some
|
||
useful manner? Remarkably, many on-line competitive
|
||
algorithms exist for reordering lists to reflect access
|
||
prediction. (See <xref linkend="biblio.motwani95random"/> and <xref linkend="biblio.andrew04mtf"/>).
|
||
</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.list.details">
|
||
<info><title>Details</title></info>
|
||
<para>
|
||
</para>
|
||
<section xml:id="container.list.details.ds">
|
||
<info><title>Underlying Data Structure</title></info>
|
||
|
||
<para>The graphic below shows a
|
||
simple list of integer keys. If we search for the integer 6, we
|
||
are paying an overhead: the link with key 6 is only the fifth
|
||
link; if it were the first link, it could be accessed
|
||
faster.</para>
|
||
|
||
<figure>
|
||
<title>A simple list</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_simple_list.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>A simple list</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>List-update algorithms reorder lists as elements are
|
||
accessed. They try to determine, by the access history, which
|
||
keys to move to the front of the list. Some of these algorithms
|
||
require adding some metadata alongside each entry.</para>
|
||
|
||
<para>For example, in the graphic below label A shows the counter
|
||
algorithm. Each node contains both a key and a count metadata
|
||
(shown in bold). When an element is accessed (e.g. 6) its count is
|
||
incremented, as shown in label B. If the count reaches some
|
||
predetermined value, say 10, as shown in label C, the count is set
|
||
to 0 and the node is moved to the front of the list, as in label
|
||
D.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>The counter algorithm</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_list_update.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>The counter algorithm</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.list.details.policies">
|
||
<info><title>Policies</title></info>
|
||
|
||
<para>this library allows instantiating lists with policies
|
||
implementing any algorithm moving nodes to the front of the
|
||
list (policies implementing algorithms interchanging nodes are
|
||
unsupported).</para>
|
||
|
||
<para>Associative containers based on lists are parametrized by a
|
||
<classname>Update_Policy</classname> parameter. This parameter defines the
|
||
type of metadata each node contains, how to create the
|
||
metadata, and how to decide, using this metadata, whether to
|
||
move a node to the front of the list. A list-based associative
|
||
container object derives (publicly) from its update policy.
|
||
</para>
|
||
|
||
<para>An instantiation of <classname>Update_Policy</classname> must define
|
||
internally <classname>update_metadata</classname> as the metadata it
|
||
requires. Internally, each node of the list contains, besides
|
||
the usual key and data, an instance of <classname>typename
|
||
Update_Policy::update_metadata</classname>.</para>
|
||
|
||
<para>An instantiation of <classname>Update_Policy</classname> must define
|
||
internally two operators:</para>
|
||
<programlisting>
|
||
update_metadata
|
||
operator()();
|
||
|
||
bool
|
||
operator()(update_metadata &);
|
||
</programlisting>
|
||
|
||
<para>The first is called by the container object, when creating a
|
||
new node, to create the node's metadata. The second is called
|
||
by the container object, when a node is accessed (
|
||
when a find operation's key is equivalent to the key of the
|
||
node), to determine whether to move the node to the front of
|
||
the list.
|
||
</para>
|
||
|
||
<para>The library contains two predefined implementations of
|
||
list-update policies. The first
|
||
is <classname>lu_counter_policy</classname>, which implements the
|
||
counter algorithm described above. The second is
|
||
<classname>lu_move_to_front_policy</classname>,
|
||
which unconditionally move an accessed element to the front of
|
||
the list. The latter type is very useful in this library,
|
||
since there is no need to associate metadata with each element.
|
||
(See <xref linkend="biblio.andrew04mtf"/>
|
||
</para>
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.list.details.mapped">
|
||
<info><title>Use in Multimaps</title></info>
|
||
|
||
<para>In this library, there are no equivalents for the standard's
|
||
multimaps and multisets; instead one uses an associative
|
||
container mapping primary keys to secondary keys.</para>
|
||
|
||
<para>List-based containers are especially useful as associative
|
||
containers for secondary keys. In fact, they are implemented
|
||
here expressly for this purpose.</para>
|
||
|
||
<para>To begin with, these containers use very little per-entry
|
||
structure memory overhead, since they can be implemented as
|
||
singly-linked lists. (Arrays use even lower per-entry memory
|
||
overhead, but they are less flexible in moving around entries,
|
||
and have weaker invalidation guarantees).</para>
|
||
|
||
<para>More importantly, though, list-based containers use very
|
||
little per-container memory overhead. The memory overhead of an
|
||
empty list-based container is practically that of a pointer.
|
||
This is important for when they are used as secondary
|
||
associative-containers in situations where the average ratio of
|
||
secondary keys to primary keys is low (or even 1).</para>
|
||
|
||
<para>In order to reduce the per-container memory overhead as much
|
||
as possible, they are implemented as closely as possible to
|
||
singly-linked lists.</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
List-based containers do not store internally the number
|
||
of values that they hold. This means that their <function>size</function>
|
||
method has linear complexity (just like <classname>std::list</classname>).
|
||
Note that finding the number of equivalent-key values in a
|
||
standard multimap also has linear complexity (because it must be
|
||
done, via <function>std::distance</function> of the
|
||
multimap's <function>equal_range</function> method), but usually with
|
||
higher constants.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
Most associative-container objects each hold a policy
|
||
object (a hash-based container object holds a
|
||
hash functor). List-based containers, conversely, only have
|
||
class-wide policy objects.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
|
||
</section>
|
||
|
||
</section> <!-- details -->
|
||
|
||
</section> <!-- list -->
|
||
|
||
|
||
<!-- priority_queue -->
|
||
<section xml:id="pbds.design.container.priority_queue">
|
||
<info><title>Priority Queue</title></info>
|
||
|
||
<section xml:id="container.priority_queue.interface">
|
||
<info><title>Interface</title></info>
|
||
|
||
<para>The priority queue container has the following
|
||
declaration:
|
||
</para>
|
||
<programlisting>
|
||
template<typename Value_Type,
|
||
typename Cmp_Fn = std::less<Value_Type>,
|
||
typename Tag = pairing_heap_tag,
|
||
typename Allocator = std::allocator<char > >
|
||
class priority_queue;
|
||
</programlisting>
|
||
|
||
<para>The parameters have the following meaning:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para><classname>Value_Type</classname> is the value type.</para></listitem>
|
||
|
||
<listitem><para><classname>Cmp_Fn</classname> is a value comparison functor</para></listitem>
|
||
|
||
<listitem><para><classname>Tag</classname> specifies which underlying data structure
|
||
to use.</para></listitem>
|
||
|
||
<listitem><para><classname>Allocator</classname> is an allocator
|
||
type.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>The <classname>Tag</classname> parameter specifies which underlying
|
||
data structure to use. Instantiating it by<classname>pairing_heap_tag</classname>,<classname>binary_heap_tag</classname>,
|
||
<classname>binomial_heap_tag</classname>,
|
||
<classname>rc_binomial_heap_tag</classname>,
|
||
or <classname>thin_heap_tag</classname>,
|
||
specifies, respectively,
|
||
an underlying pairing heap (<xref linkend="biblio.fredman86pairing"/>),
|
||
binary heap (<xref linkend="biblio.clrs2001"/>),
|
||
binomial heap (<xref linkend="biblio.clrs2001"/>),
|
||
a binomial heap with a redundant binary counter (<xref linkend="biblio.maverik_lowerbounds"/>),
|
||
or a thin heap (<xref linkend="biblio.kt99fat_heaps"/>).
|
||
</para>
|
||
|
||
<para>
|
||
As mentioned in the tutorial,
|
||
<classname>__gnu_pbds::priority_queue</classname> shares most of the
|
||
same interface with <classname>std::priority_queue</classname>.
|
||
E.g. if <varname>q</varname> is a priority queue of type
|
||
<classname>Q</classname>, then <function>q.top()</function> will
|
||
return the "largest" value in the container (according to
|
||
<classname>typename
|
||
Q::cmp_fn</classname>). <classname>__gnu_pbds::priority_queue</classname>
|
||
has a larger (and very slightly different) interface than
|
||
<classname>std::priority_queue</classname>, however, since typically
|
||
<classname>push</classname> and <classname>pop</classname> are deemed
|
||
insufficient for manipulating priority-queues. </para>
|
||
|
||
<para>Different settings require different priority-queue
|
||
implementations which are described in later; see traits
|
||
discusses ways to differentiate between the different traits of
|
||
different implementations.</para>
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.priority_queue.details">
|
||
<info><title>Details</title></info>
|
||
|
||
<section xml:id="container.priority_queue.details.iterators">
|
||
<info><title>Iterators</title></info>
|
||
|
||
<para>There are many different underlying-data structures for
|
||
implementing priority queues. Unfortunately, most such
|
||
structures are oriented towards making <function>push</function> and
|
||
<function>top</function> efficient, and consequently don't allow efficient
|
||
access of other elements: for instance, they cannot support an efficient
|
||
<function>find</function> method. In the use case where it
|
||
is important to both access and "do something with" an
|
||
arbitrary value, one would be out of luck. For example, many graph algorithms require
|
||
modifying a value (typically increasing it in the sense of the
|
||
priority queue's comparison functor).</para>
|
||
|
||
<para>In order to access and manipulate an arbitrary value in a
|
||
priority queue, one needs to reference the internals of the
|
||
priority queue from some form of an associative container -
|
||
this is unavoidable. Of course, in order to maintain the
|
||
encapsulation of the priority queue, this needs to be done in a
|
||
way that minimizes exposure to implementation internals.</para>
|
||
|
||
<para>In this library the priority queue's <function>insert</function>
|
||
method returns an iterator, which if valid can be used for subsequent <function>modify</function> and
|
||
<function>erase</function> operations. This both preserves the priority
|
||
queue's encapsulation, and allows accessing arbitrary values (since the
|
||
returned iterators from the <function>push</function> operation can be
|
||
stored in some form of associative container).</para>
|
||
|
||
<para>Priority queues' iterators present a problem regarding their
|
||
invalidation guarantees. One assumes that calling
|
||
<function>operator++</function> on an iterator will associate it
|
||
with the "next" value. Priority-queues are
|
||
self-organizing: each operation changes what the "next" value
|
||
means. Consequently, it does not make sense that <function>push</function>
|
||
will return an iterator that can be incremented - this can have
|
||
no possible use. Also, as in the case of hash-based containers,
|
||
it is awkward to define if a subsequent <function>push</function> operation
|
||
invalidates a prior returned iterator: it invalidates it in the
|
||
sense that its "next" value is not related to what it
|
||
previously considered to be its "next" value. However, it might not
|
||
invalidate it, in the sense that it can be
|
||
de-referenced and used for <function>modify</function> and <function>erase</function>
|
||
operations.</para>
|
||
|
||
<para>Similarly to the case of the other unordered associative
|
||
containers, this library uses a distinction between
|
||
point-type and range type iterators. A priority queue's <classname>iterator</classname> can always be
|
||
converted to a <classname>point_iterator</classname>, and a
|
||
<classname>const_iterator</classname> can always be converted to a
|
||
<classname>point_const_iterator</classname>.</para>
|
||
|
||
<para>The following snippet demonstrates manipulating an arbitrary
|
||
value:</para>
|
||
<programlisting>
|
||
// A priority queue of integers.
|
||
priority_queue<int > p;
|
||
|
||
// Insert some values into the priority queue.
|
||
priority_queue<int >::point_iterator it = p.push(0);
|
||
|
||
p.push(1);
|
||
p.push(2);
|
||
|
||
// Now modify a value.
|
||
p.modify(it, 3);
|
||
|
||
assert(p.top() == 3);
|
||
</programlisting>
|
||
|
||
|
||
<para>It should be noted that an alternative design could embed an
|
||
associative container in a priority queue. Could, but most
|
||
probably should not. To begin with, it should be noted that one
|
||
could always encapsulate a priority queue and an associative
|
||
container mapping values to priority queue iterators with no
|
||
performance loss. One cannot, however, "un-encapsulate" a priority
|
||
queue embedding an associative container, which might lead to
|
||
performance loss. Assume, that one needs to associate each value
|
||
with some data unrelated to priority queues. Then using
|
||
this library's design, one could use an
|
||
associative container mapping each value to a pair consisting of
|
||
this data and a priority queue's iterator. Using the embedded
|
||
method would need to use two associative containers. Similar
|
||
problems might arise in cases where a value can reside
|
||
simultaneously in many priority queues.</para>
|
||
|
||
</section>
|
||
|
||
|
||
<section xml:id="container.priority_queue.details.d">
|
||
<info><title>Underlying Data Structure</title></info>
|
||
|
||
<para>There are three main implementations of priority queues: the
|
||
first employs a binary heap, typically one which uses a
|
||
sequence; the second uses a tree (or forest of trees), which is
|
||
typically less structured than an associative container's tree;
|
||
the third simply uses an associative container. These are
|
||
shown in the graphic below, in labels A1 and A2, label B, and label C.</para>
|
||
|
||
<figure>
|
||
<title>Underlying Priority-Queue Data-Structures.</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_priority_queue_different_underlying_dss.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Underlying Priority-Queue Data-Structures.</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
<para>Roughly speaking, any value that is both pushed and popped
|
||
from a priority queue must incur a logarithmic expense (in the
|
||
amortized sense). Any priority queue implementation that would
|
||
avoid this, would violate known bounds on comparison-based
|
||
sorting (see <xref linkend="biblio.clrs2001"/> and <xref linkend="biblio.brodal96priority"/>).
|
||
</para>
|
||
|
||
<para>Most implementations do
|
||
not differ in the asymptotic amortized complexity of
|
||
<function>push</function> and <function>pop</function> operations, but they differ in
|
||
the constants involved, in the complexity of other operations
|
||
(e.g., <function>modify</function>), and in the worst-case
|
||
complexity of single operations. In general, the more
|
||
"structured" an implementation (i.e., the more internal
|
||
invariants it possesses) - the higher its amortized complexity
|
||
of <function>push</function> and <function>pop</function> operations.</para>
|
||
|
||
<para>This library implements different algorithms using a
|
||
single class: <classname>priority_queue</classname>.
|
||
Instantiating the <classname>Tag</classname> template parameter, "selects"
|
||
the implementation:</para>
|
||
|
||
<orderedlist>
|
||
<listitem><para>
|
||
Instantiating <classname>Tag = binary_heap_tag</classname> creates
|
||
a binary heap of the form in represented in the graphic with labels A1 or A2. The former is internally
|
||
selected by priority_queue
|
||
if <classname>Value_Type</classname> is instantiated by a primitive type
|
||
(e.g., an <type>int</type>); the latter is
|
||
internally selected for all other types (e.g.,
|
||
<classname>std::string</classname>). This implementations is relatively
|
||
unstructured, and so has good <classname>push</classname> and <classname>pop</classname>
|
||
performance; it is the "best-in-kind" for primitive
|
||
types, e.g., <type>int</type>s. Conversely, it has
|
||
high worst-case performance, and can support only linear-time
|
||
<function>modify</function> and <function>erase</function> operations.</para></listitem>
|
||
|
||
<listitem><para>Instantiating <classname>Tag =
|
||
pairing_heap_tag</classname> creates a pairing heap of the form
|
||
in represented by label B in the graphic above. This
|
||
implementations too is relatively unstructured, and so has good
|
||
<function>push</function> and <function>pop</function>
|
||
performance; it is the "best-in-kind" for non-primitive types,
|
||
e.g., <classname>std:string</classname>s. It also has very good
|
||
worst-case <function>push</function> and
|
||
<function>join</function> performance (O(1)), but has high
|
||
worst-case <function>pop</function>
|
||
complexity.</para></listitem>
|
||
|
||
<listitem><para>Instantiating <classname>Tag =
|
||
binomial_heap_tag</classname> creates a binomial heap of the
|
||
form repsented by label B in the graphic above. This
|
||
implementations is more structured than a pairing heap, and so
|
||
has worse <function>push</function> and <function>pop</function>
|
||
performance. Conversely, it has sub-linear worst-case bounds for
|
||
<function>pop</function>, e.g., and so it might be preferred in
|
||
cases where responsiveness is important.</para></listitem>
|
||
|
||
<listitem><para>Instantiating <classname>Tag =
|
||
rc_binomial_heap_tag</classname> creates a binomial heap of the
|
||
form represented in label B above, accompanied by a redundant
|
||
counter which governs the trees. This implementations is
|
||
therefore more structured than a binomial heap, and so has worse
|
||
<function>push</function> and <function>pop</function>
|
||
performance. Conversely, it guarantees O(1)
|
||
<function>push</function> complexity, and so it might be
|
||
preferred in cases where the responsiveness of a binomial heap
|
||
is insufficient.</para></listitem>
|
||
|
||
<listitem><para>Instantiating <classname>Tag =
|
||
thin_heap_tag</classname> creates a thin heap of the form
|
||
represented by the label B in the graphic above. This
|
||
implementations too is more structured than a pairing heap, and
|
||
so has worse <function>push</function> and
|
||
<function>pop</function> performance. Conversely, it has better
|
||
worst-case and identical amortized complexities than a Fibonacci
|
||
heap, and so might be more appropriate for some graph
|
||
algorithms.</para></listitem>
|
||
</orderedlist>
|
||
|
||
<para>Of course, one can use any order-preserving associative
|
||
container as a priority queue, as in the graphic above label C, possibly by creating an adapter class
|
||
over the associative container (much as
|
||
<classname>std::priority_queue</classname> can adapt <classname>std::vector</classname>).
|
||
This has the advantage that no cross-referencing is necessary
|
||
at all; the priority queue itself is an associative container.
|
||
Most associative containers are too structured to compete with
|
||
priority queues in terms of <function>push</function> and <function>pop</function>
|
||
performance.</para>
|
||
|
||
|
||
|
||
</section>
|
||
|
||
<section xml:id="container.priority_queue.details.traits">
|
||
<info><title>Traits</title></info>
|
||
|
||
<para>It would be nice if all priority queues could
|
||
share exactly the same behavior regardless of implementation. Sadly, this is not possible. Just one for instance is in join operations: joining
|
||
two binary heaps might throw an exception (not corrupt
|
||
any of the heaps on which it operates), but joining two pairing
|
||
heaps is exception free.</para>
|
||
|
||
<para>Tags and traits are very useful for manipulating generic
|
||
types. <classname>__gnu_pbds::priority_queue</classname>
|
||
publicly defines <classname>container_category</classname> as one of the tags. Given any
|
||
container <classname>Cntnr</classname>, the tag of the underlying
|
||
data structure can be found via <classname>typename
|
||
Cntnr::container_category</classname>; this is one of the possible tags shown in the graphic below.
|
||
</para>
|
||
|
||
<figure>
|
||
<title>Priority-Queue Data-Structure Tags.</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata align="center" format="PNG" scale="100"
|
||
fileref="../images/pbds_priority_queue_tag_hierarchy.png"/>
|
||
</imageobject>
|
||
<textobject>
|
||
<phrase>Priority-Queue Data-Structure Tags.</phrase>
|
||
</textobject>
|
||
</mediaobject>
|
||
</figure>
|
||
|
||
|
||
<para>Additionally, a traits mechanism can be used to query a
|
||
container type for its attributes. Given any container
|
||
<classname>Cntnr</classname>, then <programlisting>__gnu_pbds::container_traits<Cntnr></programlisting>
|
||
is a traits class identifying the properties of the
|
||
container.</para>
|
||
|
||
<para>To find if a container might throw if two of its objects are
|
||
joined, one can use
|
||
<programlisting>
|
||
container_traits<Cntnr>::split_join_can_throw
|
||
</programlisting>
|
||
</para>
|
||
|
||
<para>
|
||
Different priority-queue implementations have different invalidation guarantees. This is
|
||
especially important, since there is no way to access an arbitrary
|
||
value of priority queues except for iterators. Similarly to
|
||
associative containers, one can use
|
||
<programlisting>
|
||
container_traits<Cntnr>::invalidation_guarantee
|
||
</programlisting>
|
||
to get the invalidation guarantee type of a priority queue.</para>
|
||
|
||
<para>It is easy to understand from the graphic above, what <classname>container_traits<Cntnr>::invalidation_guarantee</classname>
|
||
will be for different implementations. All implementations of
|
||
type represented by label B have <classname>point_invalidation_guarantee</classname>:
|
||
the container can freely internally reorganize the nodes -
|
||
range-type iterators are invalidated, but point-type iterators
|
||
are always valid. Implementations of type represented by labels A1 and A2 have <classname>basic_invalidation_guarantee</classname>:
|
||
the container can freely internally reallocate the array - both
|
||
point-type and range-type iterators might be invalidated.</para>
|
||
|
||
<para>
|
||
This has major implications, and constitutes a good reason to avoid
|
||
using binary heaps. A binary heap can perform <function>modify</function>
|
||
or <function>erase</function> efficiently given a valid point-type
|
||
iterator. However, in order to supply it with a valid point-type
|
||
iterator, one needs to iterate (linearly) over all
|
||
values, then supply the relevant iterator (recall that a
|
||
range-type iterator can always be converted to a point-type
|
||
iterator). This means that if the number of <function>modify</function> or
|
||
<function>erase</function> operations is non-negligible (say
|
||
super-logarithmic in the total sequence of operations) - binary
|
||
heaps will perform badly.
|
||
</para>
|
||
|
||
</section>
|
||
|
||
</section> <!-- details -->
|
||
|
||
</section> <!-- priority_queue -->
|
||
|
||
|
||
|
||
</section> <!-- container -->
|
||
|
||
</section> <!-- design -->
|
||
|
||
|
||
|
||
<!-- S04: Test -->
|
||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" parse="xml"
|
||
href="test_policy_data_structures.xml">
|
||
</xi:include>
|
||
|
||
<!-- S05: Reference/Acknowledgments -->
|
||
<section xml:id="pbds.ack">
|
||
<info><title>Acknowledgments</title></info>
|
||
<?dbhtml filename="policy_data_structures_ack.html"?>
|
||
|
||
<para>
|
||
Written by Ami Tavory and Vladimir Dreizin (IBM Haifa Research
|
||
Laboratories), and Benjamin Kosnik (Red Hat).
|
||
</para>
|
||
|
||
<para>
|
||
This library was partially written at IBM's Haifa Research Labs.
|
||
It is based heavily on policy-based design and uses many useful
|
||
techniques from Modern C++ Design: Generic Programming and Design
|
||
Patterns Applied by Andrei Alexandrescu.
|
||
</para>
|
||
|
||
<para>
|
||
Two ideas are borrowed from the SGI-STL implementation:
|
||
</para>
|
||
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>
|
||
The prime-based resize policies use a list of primes taken from
|
||
the SGI-STL implementation.
|
||
</para>
|
||
</listitem>
|
||
|
||
<listitem>
|
||
<para>
|
||
The red-black trees contain both a root node and a header node
|
||
(containing metadata), connected in a way that forward and
|
||
reverse iteration can be performed efficiently.
|
||
</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
|
||
<para>
|
||
Some test utilities borrow ideas from
|
||
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.boost.org/doc/libs/release/libs/timer/index.html">boost::timer</link>.
|
||
</para>
|
||
|
||
<para>
|
||
We would like to thank Scott Meyers for useful comments (without
|
||
attributing to him any flaws in the design or implementation of the
|
||
library).
|
||
</para>
|
||
<para>We would like to thank Matt Austern for the suggestion to
|
||
include tries.</para>
|
||
</section>
|
||
|
||
<!-- S06: Biblio -->
|
||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" parse="xml"
|
||
href="policy_data_structures_biblio.xml">
|
||
</xi:include>
|
||
|
||
</chapter>
|