codecvt.html: Behind-the-scenes ASCII->HTML tweaks for certain browsers.
2000-08-30 Phil Edwards <pme@sources.redhat.com> * docs/22_locale/codecvt.html: Behind-the-scenes ASCII->HTML tweaks for certain browsers. From-SVN: r36067
This commit is contained in:
parent
aef9fbbf89
commit
ad82183b0e
@ -1,3 +1,8 @@
|
|||||||
|
2000-08-30 Phil Edwards <pme@sources.redhat.com>
|
||||||
|
|
||||||
|
* docs/22_locale/codecvt.html: Behind-the-scenes ASCII->HTML
|
||||||
|
tweaks for certain browsers.
|
||||||
|
|
||||||
2000-08-28 Benjamin Kosnik <bkoz@purist.soma.redhat.com>
|
2000-08-28 Benjamin Kosnik <bkoz@purist.soma.redhat.com>
|
||||||
|
|
||||||
* docs/22_locale/codecvt.html: Add more bits, format.
|
* docs/22_locale/codecvt.html: Add more bits, format.
|
||||||
|
@ -17,7 +17,7 @@ The standard class codecvt attempts to address conversions between
|
|||||||
different character encoding schemes. In particular, the standard
|
different character encoding schemes. In particular, the standard
|
||||||
attempts to detail conversions between the implementation-defined wide
|
attempts to detail conversions between the implementation-defined wide
|
||||||
characters (hereafter referred to as wchar_t) and the standard type
|
characters (hereafter referred to as wchar_t) and the standard type
|
||||||
char that is so beloved in classic "C" (which can now be referred to
|
char that is so beloved in classic "C" (which can now be referred to
|
||||||
as narrow characters.) This document attempts to describe how the GNU
|
as narrow characters.) This document attempts to describe how the GNU
|
||||||
libstdc++-v3 implementation deals with the conversion between wide and
|
libstdc++-v3 implementation deals with the conversion between wide and
|
||||||
narrow characters, and also presents a framework for dealing with the
|
narrow characters, and also presents a framework for dealing with the
|
||||||
@ -42,7 +42,7 @@ The text around the codecvt definition gives some clues:
|
|||||||
|
|
||||||
<BLOCKQUOTE>
|
<BLOCKQUOTE>
|
||||||
<I>
|
<I>
|
||||||
-1- The class codecvt<internT,externT,stateT> is for use when
|
-1- The class codecvt<internT,externT,stateT> is for use when
|
||||||
converting from one codeset to another, such as from wide characters
|
converting from one codeset to another, such as from wide characters
|
||||||
to multibyte characters, between wide character encodings such as
|
to multibyte characters, between wide character encodings such as
|
||||||
Unicode and EUC.
|
Unicode and EUC.
|
||||||
@ -68,11 +68,11 @@ Ah ha! Another clue...
|
|||||||
<BLOCKQUOTE>
|
<BLOCKQUOTE>
|
||||||
<I>
|
<I>
|
||||||
-3- The instantiations required in the Table ??
|
-3- The instantiations required in the Table ??
|
||||||
(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and
|
(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and
|
||||||
codecvt<char,char,mbstate_t>, convert the implementation-defined
|
codecvt<char,char,mbstate_t>, convert the implementation-defined
|
||||||
native character set. codecvt<char,char,mbstate_t> implements a
|
native character set. codecvt<char,char,mbstate_t> implements a
|
||||||
degenerate conversion; it does not convert at
|
degenerate conversion; it does not convert at
|
||||||
all. codecvt<wchar_t,char,mbstate_t> converts between the native
|
all. codecvt<wchar_t,char,mbstate_t> converts between the native
|
||||||
character sets for tiny and wide characters. Instantiations on
|
character sets for tiny and wide characters. Instantiations on
|
||||||
mbstate_t perform conversion between encodings known to the library
|
mbstate_t perform conversion between encodings known to the library
|
||||||
implementor. Other encodings can be converted by specializing on a
|
implementor. Other encodings can be converted by specializing on a
|
||||||
@ -100,7 +100,7 @@ mcsrtombs and wcsrtombs in particular.</P>
|
|||||||
2. Some thoughts on what would be useful
|
2. Some thoughts on what would be useful
|
||||||
</H2>
|
</H2>
|
||||||
Probably the most frequently asked question about code conversion is:
|
Probably the most frequently asked question about code conversion is:
|
||||||
"So dudes, what's the deal with Unicode strings?" The dude part is
|
"So dudes, what's the deal with Unicode strings?" The dude part is
|
||||||
optional, but apparently the usefulness of Unicode strings is pretty
|
optional, but apparently the usefulness of Unicode strings is pretty
|
||||||
widely appreciated. Sadly, this specific encoding (And other useful
|
widely appreciated. Sadly, this specific encoding (And other useful
|
||||||
encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned
|
encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned
|
||||||
@ -168,7 +168,8 @@ UTF-16, UTF8, UTF16).
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
For iconv-based implementations, string literals for each of the
|
For iconv-based implementations, string literals for each of the
|
||||||
encodings (ie. "UCS-2" and "UTF-8") are necessary, although for other,
|
encodings (ie. "UCS-2" and "UTF-8") are necessary,
|
||||||
|
although for other,
|
||||||
non-iconv implementations a table of enumerated values or some other
|
non-iconv implementations a table of enumerated values or some other
|
||||||
mechanism may be required.
|
mechanism may be required.
|
||||||
|
|
||||||
@ -178,13 +179,13 @@ mechanism may be required.
|
|||||||
<LI>
|
<LI>
|
||||||
Some encodings are require explicit endian-ness. As such, some kind
|
Some encodings are require explicit endian-ness. As such, some kind
|
||||||
of endian marker or other byte-order marker will be necessary. See
|
of endian marker or other byte-order marker will be necessary. See
|
||||||
"Footnotes for C/C++ developers" in Haible for more information on
|
"Footnotes for C/C++ developers" in Haible for more information on
|
||||||
UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
|
UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
|
||||||
however implementations, most notably Microsoft, vary.)
|
however implementations, most notably Microsoft, vary.)
|
||||||
|
|
||||||
<LI>
|
<LI>
|
||||||
Types representing the conversion state, for conversions involving
|
Types representing the conversion state, for conversions involving
|
||||||
the machinery in the "C" library, or the conversion descriptor, for
|
the machinery in the "C" library, or the conversion descriptor, for
|
||||||
conversions using iconv (such as the type iconv_t.) Note that the
|
conversions using iconv (such as the type iconv_t.) Note that the
|
||||||
conversion descriptor encodes more information than a simple encoding
|
conversion descriptor encodes more information than a simple encoding
|
||||||
state type.
|
state type.
|
||||||
@ -207,14 +208,14 @@ mechanism may be required.
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
<H2>
|
<H2>
|
||||||
3. Problems with "C" code conversions : thread safety, global locales,
|
3. Problems with "C" code conversions : thread safety, global
|
||||||
termination.
|
locales, termination.
|
||||||
</H2>
|
</H2>
|
||||||
|
|
||||||
In addition, multi-threaded and multi-locale environments also impact
|
In addition, multi-threaded and multi-locale environments also impact
|
||||||
the design and requirements for code conversions. In particular, they
|
the design and requirements for code conversions. In particular, they
|
||||||
affect the required specialization codecvt<wchar_t, char, mbstate_t>
|
affect the required specialization codecvt<wchar_t, char, mbstate_t>
|
||||||
when implemented using standard "C" functions.
|
when implemented using standard "C" functions.
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
Three problems arise, one big, one of medium importance, and one small.
|
Three problems arise, one big, one of medium importance, and one small.
|
||||||
@ -233,7 +234,7 @@ incorrect. Yikes!
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
The last, and fundamental problem, is the assumption of a global
|
The last, and fundamental problem, is the assumption of a global
|
||||||
locale for all the "C" functions referenced above. For something like
|
locale for all the "C" functions referenced above. For something like
|
||||||
C++ iostreams (where codecvt is explicitly used) the notion of
|
C++ iostreams (where codecvt is explicitly used) the notion of
|
||||||
multiple locales is fundamental. In practice, most users may not run
|
multiple locales is fundamental. In practice, most users may not run
|
||||||
into this limitation. However, as a quality of implementation issue,
|
into this limitation. However, as a quality of implementation issue,
|
||||||
@ -243,7 +244,7 @@ correct results. In short, libstdc++-v3 is trying to offer, as an
|
|||||||
option, a high-quality implementation, damn the additional complexity!
|
option, a high-quality implementation, damn the additional complexity!
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
For the required specialization codecvt<wchar_t, char, mbstate_t> ,
|
For the required specialization codecvt<wchar_t, char, mbstate_t> ,
|
||||||
conversions are made between the internal character set (always UCS4
|
conversions are made between the internal character set (always UCS4
|
||||||
on GNU/Linux) and whatever the currently selected locale for the
|
on GNU/Linux) and whatever the currently selected locale for the
|
||||||
LC_CTYPE category implements.
|
LC_CTYPE category implements.
|
||||||
@ -256,7 +257,7 @@ The two required specializations are implemented as follows:
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
<TT>
|
<TT>
|
||||||
codecvt<char, char, mbstate_t>
|
codecvt<char, char, mbstate_t>
|
||||||
</TT>
|
</TT>
|
||||||
<P>
|
<P>
|
||||||
This is a degenerate (ie, does nothing) specialization. Implementing
|
This is a degenerate (ie, does nothing) specialization. Implementing
|
||||||
@ -264,7 +265,7 @@ this was a piece of cake.
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
<TT>
|
<TT>
|
||||||
codecvt<char, wchar_t, mbstate_t>
|
codecvt<char, wchar_t, mbstate_t>
|
||||||
</TT>
|
</TT>
|
||||||
<P>
|
<P>
|
||||||
This specialization, by specifying all the template parameters, pretty
|
This specialization, by specifying all the template parameters, pretty
|
||||||
@ -353,7 +354,7 @@ ready to convert and will return true.
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
<TT>
|
<TT>
|
||||||
__enc_traits(const __enc_traits&)
|
__enc_traits(const __enc_traits&)
|
||||||
</TT>
|
</TT>
|
||||||
<P>
|
<P>
|
||||||
As iconv allocates memory and sets up conversion descriptors, the copy
|
As iconv allocates memory and sets up conversion descriptors, the copy
|
||||||
@ -363,8 +364,8 @@ themselves.
|
|||||||
|
|
||||||
<P>
|
<P>
|
||||||
Definitions for all the required codecvt member functions are provided
|
Definitions for all the required codecvt member functions are provided
|
||||||
for this specialization, and usage of codecvt<internal character type,
|
for this specialization, and usage of codecvt<internal character type,
|
||||||
external character type, __enc_traits> is consistent with other
|
external character type, __enc_traits> is consistent with other
|
||||||
codecvt usage.
|
codecvt usage.
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
@ -379,7 +380,7 @@ a. conversions involving string literals
|
|||||||
typedef unicode_t int_type;
|
typedef unicode_t int_type;
|
||||||
typedef char ext_type;
|
typedef char ext_type;
|
||||||
typedef __enc_traits enc_type;
|
typedef __enc_traits enc_type;
|
||||||
typedef codecvt<int_type, ext_type, enc_type> unicode_codecvt;
|
typedef codecvt<int_type, ext_type, enc_type> unicode_codecvt;
|
||||||
|
|
||||||
const ext_type* e_lit = "black pearl jasmine tea";
|
const ext_type* e_lit = "black pearl jasmine tea";
|
||||||
int size = strlen(e_lit);
|
int size = strlen(e_lit);
|
||||||
@ -399,8 +400,8 @@ a. conversions involving string literals
|
|||||||
// construct a locale object with the specialized facet.
|
// construct a locale object with the specialized facet.
|
||||||
locale loc(locale::classic(), new unicode_codecvt);
|
locale loc(locale::classic(), new unicode_codecvt);
|
||||||
// sanity check the constructed locale has the specialized facet.
|
// sanity check the constructed locale has the specialized facet.
|
||||||
VERIFY( has_facet<unicode_codecvt>(loc) );
|
VERIFY( has_facet<unicode_codecvt>(loc) );
|
||||||
const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
|
const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
|
||||||
// convert between const char* and unicode strings
|
// convert between const char* and unicode strings
|
||||||
unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
|
unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
|
||||||
initialize_state(state01);
|
initialize_state(state01);
|
||||||
@ -454,7 +455,8 @@ codecvt_wchar_t_char.cc
|
|||||||
standards-conformant manner?
|
standards-conformant manner?
|
||||||
|
|
||||||
<LI>
|
<LI>
|
||||||
how to synchronize the "C" and "C++" conversion information?
|
how to synchronize the "C" and "C++"
|
||||||
|
conversion information?
|
||||||
|
|
||||||
<LI>
|
<LI>
|
||||||
wchar_t/char internal buffers and conversions between
|
wchar_t/char internal buffers and conversions between
|
||||||
@ -475,17 +477,17 @@ specialization hints, language clarification, and wchar_t fixes.
|
|||||||
8. Bibliography / Referenced Documents
|
8. Bibliography / Referenced Documents
|
||||||
</H2>
|
</H2>
|
||||||
|
|
||||||
Drepper, Ulrich, GNU libc (glibc) 2.2 manual. In particular, Chapters "6. Character Set Handling" and "7 Locales and Internationalization"
|
Drepper, Ulrich, GNU libc (glibc) 2.2 manual. In particular, Chapters "6. Character Set Handling" and "7 Locales and Internationalization"
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
Drepper, Ulrich, Numerous, late-night email correspondence
|
Drepper, Ulrich, Numerous, late-night email correspondence
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
Feather, Clive, "A brief description of Normative Addendum 1," in particular the parts on Extended Character Sets
|
Feather, Clive, "A brief description of Normative Addendum 1," in particular the parts on Extended Character Sets
|
||||||
http://www.lysator.liu.se/c/na1.html
|
http://www.lysator.liu.se/c/na1.html
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
Haible, Bruno, "The Unicode HOWTO" v0.18, 4 August 2000
|
Haible, Bruno, "The Unicode HOWTO" v0.18, 4 August 2000
|
||||||
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
|
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
@ -495,7 +497,7 @@ ISO/IEC 14882:1998 Programming languages - C++
|
|||||||
ISO/IEC 9899:1999 Programming languages - C
|
ISO/IEC 9899:1999 Programming languages - C
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
Khun, Markus, "UTF-8 and Unicode FAQ for Unix/Linux"
|
Khun, Markus, "UTF-8 and Unicode FAQ for Unix/Linux"
|
||||||
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
|
Loading…
Reference in New Issue
Block a user