1410e233be
2000-10-25 Bruno Haible <haible@clisp.cons.org> * manual/charset.texi: Fix spelling of __GCONV_FULL_OUTPUT. * manual/message.texi (Translation with gettext): Remove paragraph about macros contained in libintl.h. (bind_textdomain_codeset): Describe codeset argument. (Using gettextized software): Add setlocale call to sample code.
2891 lines
121 KiB
Plaintext
2891 lines
121 KiB
Plaintext
@node Character Set Handling, Locales, String and Array Utilities, Top
|
|
@c %MENU% Support for extended character sets
|
|
@chapter Character Set Handling
|
|
|
|
@ifnottex
|
|
@macro cal{text}
|
|
\text\
|
|
@end macro
|
|
@end ifnottex
|
|
|
|
Character sets used in the early days of computing had only six, seven,
|
|
or eight bits for each character: there was never a case where more than
|
|
eight bits (one byte) were used to represent a single character. The
|
|
limitations of this approach became more apparent as more people
|
|
grappled with non-Roman character sets, where not all the characters
|
|
that make up a language's character set can be represented by @math{2^8}
|
|
choices. This chapter shows the functionality which was added to the C
|
|
library to support multiple character sets.
|
|
|
|
@menu
|
|
* Extended Char Intro:: Introduction to Extended Characters.
|
|
* Charset Function Overview:: Overview about Character Handling
|
|
Functions.
|
|
* Restartable multibyte conversion:: Restartable multibyte conversion
|
|
Functions.
|
|
* Non-reentrant Conversion:: Non-reentrant Conversion Function.
|
|
* Generic Charset Conversion:: Generic Charset Conversion.
|
|
@end menu
|
|
|
|
|
|
@node Extended Char Intro
|
|
@section Introduction to Extended Characters
|
|
|
|
A variety of solutions to overcome the differences between
|
|
character sets with a 1:1 relation between bytes and characters and
|
|
character sets with ratios of 2:1 or 4:1 exist. The remainder of this
|
|
section gives a few examples to help understand the design decisions
|
|
made while developing the functionality of the @w{C library}.
|
|
|
|
@cindex internal representation
|
|
A distinction we have to make right away is between internal and
|
|
external representation. @dfn{Internal representation} means the
|
|
representation used by a program while keeping the text in memory.
|
|
External representations are used when text is stored or transmitted
|
|
through whatever communication channel. Examples of external
|
|
representations include files lying in a directory that are going to be
|
|
read and parsed.
|
|
|
|
Traditionally there has been no difference between the two representations.
|
|
It was equally comfortable and useful to use the same single-byte
|
|
representation internally and externally. This changes with more and
|
|
larger character sets.
|
|
|
|
One of the problems to overcome with the internal representation is
|
|
handling text that is externally encoded using different character
|
|
sets. Assume a program which reads two texts and compares them using
|
|
some metric. The comparison can be usefully done only if the texts are
|
|
internally kept in a common format.
|
|
|
|
@cindex wide character
|
|
For such a common format (@math{=} character set) eight bits are certainly
|
|
no longer enough. So the smallest entity will have to grow: @dfn{wide
|
|
characters} will now be used. Instead of one byte, two or four will
|
|
be used instead. (Three are not good to address in memory and more
|
|
than four bytes seem not to be necessary).
|
|
|
|
@cindex Unicode
|
|
@cindex ISO 10646
|
|
As shown in some other part of this manual,
|
|
@c !!! Ahem, wide char string functions are not yet covered -- drepper
|
|
there exists a completely new family of functions which can handle texts
|
|
of this kind in memory. The most commonly used character sets for such
|
|
internal wide character representations are Unicode and @w{ISO 10646}
|
|
(also known as UCS for Universal Character Set). Unicode was originally
|
|
planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
|
|
be a 31-bit large code space. The two standards are practically identical.
|
|
They have the same character repertoire and code table, but Unicode specifies
|
|
added semantics. At the moment, only characters in the first @code{0x10000}
|
|
code positions (the so-called Basic Multilingual Plane, BMP) have been
|
|
assigned, but the assignment of more specialized characters outside this
|
|
16-bit space is already in progress. A number of encodings have been
|
|
defined for Unicode and @w{ISO 10646} characters:
|
|
@cindex UCS-2
|
|
@cindex UCS-4
|
|
@cindex UTF-8
|
|
@cindex UTF-16
|
|
UCS-2 is a 16-bit word that can only represent characters
|
|
from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
|
|
and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
|
|
ASCII characters are represented by ASCII bytes and non-ASCII characters
|
|
by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
|
|
of UCS-2 in which pairs of certain UCS-2 words can be used to encode
|
|
non-BMP characters up to @code{0x10ffff}.
|
|
|
|
To represent wide characters the @code{char} type is not suitable. For
|
|
this reason the @w{ISO C} standard introduces a new type which is
|
|
designed to keep one character of a wide character string. To maintain
|
|
the similarity there is also a type corresponding to @code{int} for
|
|
those functions which take a single wide character.
|
|
|
|
@comment stddef.h
|
|
@comment ISO
|
|
@deftp {Data type} wchar_t
|
|
This data type is used as the base type for wide character strings.
|
|
I.e., arrays of objects of this type are the equivalent of @code{char[]}
|
|
for multibyte character strings. The type is defined in @file{stddef.h}.
|
|
|
|
The @w{ISO C90} standard, where this type was introduced, does not say
|
|
anything specific about the representation. It only requires that this
|
|
type is capable of storing all elements of the basic character set.
|
|
Therefore it would be legitimate to define @code{wchar_t} as
|
|
@code{char}. This might make sense for embedded systems.
|
|
|
|
But for GNU systems this type is always 32 bits wide. It is therefore
|
|
capable of representing all UCS-4 values and therefore covering all of
|
|
@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and
|
|
thereby follow Unicode very strictly. This is perfectly fine with the
|
|
standard but it also means that to represent all characters from Unicode
|
|
and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
|
|
fact a multi-wide-character encoding. But this contradicts the purpose
|
|
of the @code{wchar_t} type.
|
|
@end deftp
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftp {Data type} wint_t
|
|
@code{wint_t} is a data type used for parameters and variables which
|
|
contain a single wide character. As the name already suggests it is the
|
|
equivalent to @code{int} when using the normal @code{char} strings. The
|
|
types @code{wchar_t} and @code{wint_t} have often the same
|
|
representation if their size if 32 bits wide but if @code{wchar_t} is
|
|
defined as @code{char} the type @code{wint_t} must be defined as
|
|
@code{int} due to the parameter promotion.
|
|
|
|
@pindex wchar.h
|
|
This type is defined in @file{wchar.h} and got introduced in
|
|
@w{Amendment 1} to @w{ISO C90}.
|
|
@end deftp
|
|
|
|
As there are for the @code{char} data type there also exist macros
|
|
specifying the minimum and maximum value representable in an object of
|
|
type @code{wchar_t}.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypevr Macro wint_t WCHAR_MIN
|
|
The macro @code{WCHAR_MIN} evaluates to the minimum value representable
|
|
by an object of type @code{wint_t}.
|
|
|
|
This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
|
|
@end deftypevr
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypevr Macro wint_t WCHAR_MAX
|
|
The macro @code{WCHAR_MIN} evaluates to the maximum value representable
|
|
by an object of type @code{wint_t}.
|
|
|
|
This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
|
|
@end deftypevr
|
|
|
|
Another special wide character value is the equivalent to @code{EOF}.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypevr Macro wint_t WEOF
|
|
The macro @code{WEOF} evaluates to a constant expression of type
|
|
@code{wint_t} whose value is different from any member of the extended
|
|
character set.
|
|
|
|
@code{WEOF} need not be the same value as @code{EOF} and unlike
|
|
@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
|
|
|
|
@smallexample
|
|
@{
|
|
int c;
|
|
...
|
|
while ((c = getc (fp)) < 0)
|
|
...
|
|
@}
|
|
@end smallexample
|
|
|
|
@noindent
|
|
has to be rewritten to explicitly use @code{WEOF} when wide characters
|
|
are used.
|
|
|
|
@smallexample
|
|
@{
|
|
wint_t c;
|
|
...
|
|
while ((c = wgetc (fp)) != WEOF)
|
|
...
|
|
@}
|
|
@end smallexample
|
|
|
|
@pindex wchar.h
|
|
This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
|
|
defined in @file{wchar.h}.
|
|
@end deftypevr
|
|
|
|
|
|
These internal representations present problems when it comes to storing
|
|
and transmittal, since a single wide character consists of more
|
|
than one byte they are effected by byte-ordering. I.e., machines with
|
|
different endianesses would see different value accessing the same data.
|
|
This also applies for communication protocols which are all byte-based
|
|
and therefore the sender has to decide about splitting the wide
|
|
character in bytes. A last (but not least important) point is that wide
|
|
characters often require more storage space than an customized byte
|
|
oriented character set.
|
|
|
|
@cindex multibyte character
|
|
@cindex EBCDIC
|
|
For all the above reasons, an external encoding which is different
|
|
from the internal encoding is often used if the latter is UCS-2 or UCS-4.
|
|
The external encoding is byte-based and can be chosen appropriately for
|
|
the environment and for the texts to be handled. There exist a variety
|
|
of different character sets which can be used for this external
|
|
encoding. Information which will not be exhaustively presented
|
|
here--instead, a description of the major groups will suffice. All of
|
|
the ASCII-based character sets [_bkoz_: do you mean Roman character
|
|
sets? If not, what do you mean here?] fulfill one requirement: they are
|
|
"filesystem safe". This means that the character @code{'/'} is used in
|
|
the encoding @emph{only} to represent itself. Things are a bit
|
|
different for character sets like EBCDIC (Extended Binary Coded Decimal
|
|
Interchange Code, a character set family used by IBM) but if the
|
|
operation system does not understand EBCDIC directly the parameters to
|
|
system calls have to be converted first anyhow.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The simplest character sets are single-byte character sets. There can be
|
|
only up to 256 characters (for @w{8 bit} character sets) which is not
|
|
sufficient to cover all languages but might be sufficient to handle a
|
|
specific text. Another reason to choose this is because of constraints
|
|
from interaction with other programs (which might not be 8-bit clean).
|
|
|
|
@cindex ISO 2022
|
|
@item
|
|
The @w{ISO 2022} standard defines a mechanism for extended character
|
|
sets where one character @emph{can} be represented by more than one
|
|
byte. This is achieved by associating a state with the text. Embedded
|
|
in the text can be characters which can be used to change the state.
|
|
Each byte in the text might have a different interpretation in each
|
|
state. The state might even influence whether a given byte stands for a
|
|
character on its own or whether it has to be combined with some more
|
|
bytes.
|
|
|
|
@cindex EUC
|
|
@cindex SJIS
|
|
In most uses of @w{ISO 2022} the defined character sets do not allow
|
|
state changes which cover more than the next character. This has the
|
|
big advantage that whenever one can identify the beginning of the byte
|
|
sequence of a character one can interpret a text correctly. Examples of
|
|
character sets using this policy are the various EUC character sets
|
|
(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
|
|
or SJIS (Shift-JIS, a Japanese encoding).
|
|
|
|
But there are also character sets using a state which is valid for more
|
|
than one character and has to be changed by another byte sequence.
|
|
Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
|
|
|
|
@item
|
|
@cindex ISO 6937
|
|
Early attempts to fix 8 bit character sets for other languages using the
|
|
Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
|
|
representing characters like the acute accent do not produce output
|
|
themselves: one has to combine them with other characters to get the
|
|
desired result. E.g., the byte sequence @code{0xc2 0x61} (non-spacing
|
|
acute accent, following by lower-case `a') to get the ``small a with
|
|
acute'' character. To get the acute accent character on its on one has
|
|
to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
|
|
|
|
This type of character set is used in some embedded systems such as
|
|
teletex.
|
|
|
|
@item
|
|
@cindex UTF-8
|
|
Instead of converting the Unicode or @w{ISO 10646} text used internally,
|
|
it is often also sufficient to simply use an encoding different than
|
|
UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
|
|
encoding: UTF-8. This encoding is able to represent all of @w{ISO
|
|
10464} 31 bits in a byte string of length one to six.
|
|
|
|
@cindex UTF-7
|
|
There were a few other attempts to encode @w{ISO 10646} such as UTF-7
|
|
but UTF-8 is today the only encoding which should be used. In fact,
|
|
UTF-8 will hopefully soon be the only external encoding that has to be
|
|
supported. It proves to be universally usable and the only disadvantage
|
|
is that it favors Roman languages by making the byte string
|
|
representation of other scripts (Cyrillic, Greek, Asian scripts) longer
|
|
than necessary if using a specific character set for these scripts.
|
|
Methods like the Unicode compression scheme can alleviate these
|
|
problems.
|
|
@end itemize
|
|
|
|
The question remaining is: how to select the character set or encoding
|
|
to use. The answer: you cannot decide about it yourself, it is decided
|
|
by the developers of the system or the majority of the users. Since the
|
|
goal is interoperability one has to use whatever the other people one
|
|
works with use. If there are no constraints the selection is based on
|
|
the requirements the expected circle of users will have. I.e., if a
|
|
project is expected to only be used in, say, Russia it is fine to use
|
|
KOI8-R or a similar character set. But if at the same time people from,
|
|
say, Greece are participating one should use a character set which allows
|
|
all people to collaborate.
|
|
|
|
The most widely useful solution seems to be: go with the most general
|
|
character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
|
|
and problems about users not being able to use their own language
|
|
adequately are a thing of the past.
|
|
|
|
One final comment about the choice of the wide character representation
|
|
is necessary at this point. We have said above that the natural choice
|
|
is using Unicode or @w{ISO 10646}. This is not required, but at least
|
|
encouraged, by the @w{ISO C} standard. The standard defines at least a
|
|
macro @code{__STDC_ISO_10646__} that is only defined on systems where
|
|
the @code{wchar_t} type encodes @w{ISO 10646} characters. If this
|
|
symbol is not defined one should as much as possible avoid making
|
|
assumption about the wide character representation. If the programmer
|
|
uses only the functions provided by the C library to handle wide
|
|
character strings there should not be any compatibility problems with
|
|
other systems.
|
|
|
|
@node Charset Function Overview
|
|
@section Overview about Character Handling Functions
|
|
|
|
A Unix @w{C library} contains three different sets of functions in two
|
|
families to handle character set conversion. The one function family
|
|
is specified in the @w{ISO C} standard and therefore is portable even
|
|
beyond the Unix world.
|
|
|
|
The most commonly known set of functions, coming from the @w{ISO C90}
|
|
standard, is unfortunately the least useful one. In fact, these
|
|
functions should be avoided whenever possible, especially when
|
|
developing libraries (as opposed to applications).
|
|
|
|
The second family of functions got introduced in the early Unix standards
|
|
(XPG2) and is still part of the latest and greatest Unix standard:
|
|
@w{Unix 98}. It is also the most powerful and useful set of functions.
|
|
But we will start with the functions defined in @w{Amendment 1} to
|
|
@w{ISO C90}.
|
|
|
|
@node Restartable multibyte conversion
|
|
@section Restartable Multibyte Conversion Functions
|
|
|
|
The @w{ISO C} standard defines functions to convert strings from a
|
|
multibyte representation to wide character strings. There are a number
|
|
of peculiarities:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The character set assumed for the multibyte encoding is not specified
|
|
as an argument to the functions. Instead the character set specified by
|
|
the @code{LC_CTYPE} category of the current locale is used; see
|
|
@ref{Locale Categories}.
|
|
|
|
@item
|
|
The functions handling more than one character at a time require NUL
|
|
terminated strings as the argument. I.e., converting blocks of text
|
|
does not work unless one can add a NUL byte at an appropriate place.
|
|
The GNU C library contains some extensions the standard which allow
|
|
specifying a size but basically they also expect terminated strings.
|
|
@end itemize
|
|
|
|
Despite these limitations the @w{ISO C} functions can very well be used
|
|
in many contexts. In graphical user interfaces, for instance, it is not
|
|
uncommon to have functions which require text to be displayed in a wide
|
|
character string if it is not simple ASCII. The text itself might come
|
|
from a file with translations and the user should decide about the
|
|
current locale which determines the translation and therefore also the
|
|
external encoding used. In such a situation (and many others) the
|
|
functions described here are perfect. If more freedom while performing
|
|
the conversion is necessary take a look at the @code{iconv} functions
|
|
(@pxref{Generic Charset Conversion}).
|
|
|
|
@menu
|
|
* Selecting the Conversion:: Selecting the conversion and its properties.
|
|
* Keeping the state:: Representing the state of the conversion.
|
|
* Converting a Character:: Converting Single Characters.
|
|
* Converting Strings:: Converting Multibyte and Wide Character
|
|
Strings.
|
|
* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
|
|
@end menu
|
|
|
|
@node Selecting the Conversion
|
|
@subsection Selecting the conversion and its properties
|
|
|
|
We already said above that the currently selected locale for the
|
|
@code{LC_CTYPE} category decides about the conversion which is performed
|
|
by the functions we are about to describe. Each locale uses its own
|
|
character set (given as an argument to @code{localedef}) and this is the
|
|
one assumed as the external multibyte encoding. The wide character
|
|
character set always is UCS-4, at least on GNU systems.
|
|
|
|
A characteristic of each multibyte character set is the maximum number
|
|
of bytes which can be necessary to represent one character. This
|
|
information is quite important when writing code which uses the
|
|
conversion functions. In the examples below we will see some examples.
|
|
The @w{ISO C} standard defines two macros which provide this information.
|
|
|
|
|
|
@comment limits.h
|
|
@comment ISO
|
|
@deftypevr Macro int MB_LEN_MAX
|
|
This macro specifies the maximum number of bytes in the multibyte
|
|
sequence for a single character in any of the supported locales. It is
|
|
a compile-time constant and it is defined in @file{limits.h}.
|
|
@pindex limits.h
|
|
@end deftypevr
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypevr Macro int MB_CUR_MAX
|
|
@code{MB_CUR_MAX} expands into a positive integer expression that is the
|
|
maximum number of bytes in a multibyte character in the current locale.
|
|
The value is never greater than @code{MB_LEN_MAX}. Unlike
|
|
@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
|
|
fact, in the GNU C library it is not.
|
|
|
|
@pindex stdlib.h
|
|
@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
|
|
@end deftypevr
|
|
|
|
Two different macros are necessary since strictly @w{ISO C90} compilers
|
|
do not allow variable length array definitions but still it is desirable
|
|
to avoid dynamic allocation. This incomplete piece of code shows the
|
|
problem:
|
|
|
|
@smallexample
|
|
@{
|
|
char buf[MB_LEN_MAX];
|
|
ssize_t len = 0;
|
|
|
|
while (! feof (fp))
|
|
@{
|
|
fread (&buf[len], 1, MB_CUR_MAX - len, fp);
|
|
/* @r{... process} buf */
|
|
len -= used;
|
|
@}
|
|
@}
|
|
@end smallexample
|
|
|
|
The code in the inner loop is expected to have always enough bytes in
|
|
the array @var{buf} to convert one multibyte character. The array
|
|
@var{buf} has to be sized statically since many compilers do not allow a
|
|
variable size. The @code{fread} call makes sure that always
|
|
@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it isn't
|
|
a problem if @code{MB_CUR_MAX} is not a compile-time constant.
|
|
|
|
|
|
@node Keeping the state
|
|
@subsection Representing the state of the conversion
|
|
|
|
@cindex stateful
|
|
In the introduction of this chapter it was said that certain character
|
|
sets use a @dfn{stateful} encoding. I.e., the encoded values depend in
|
|
some way on the previous bytes in the text.
|
|
|
|
Since the conversion functions allow converting a text in more than one
|
|
step we must have a way to pass this information from one call of the
|
|
functions to another.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftp {Data type} mbstate_t
|
|
@cindex shift state
|
|
A variable of type @code{mbstate_t} can contain all the information
|
|
about the @dfn{shift state} needed from one call to a conversion
|
|
function to another.
|
|
|
|
@pindex wchar.h
|
|
This type is defined in @file{wchar.h}. It got introduced in
|
|
@w{Amendment 1} to @w{ISO C90}.
|
|
@end deftp
|
|
|
|
To use objects of this type the programmer has to define such objects
|
|
(normally as local variables on the stack) and pass a pointer to the
|
|
object to the conversion functions. This way the conversion function
|
|
can update the object if the current multibyte character set is
|
|
stateful.
|
|
|
|
There is no specific function or initializer to put the state object in
|
|
any specific state. The rules are that the object should always
|
|
represent the initial state before the first use and this is achieved by
|
|
clearing the whole variable with code such as follows:
|
|
|
|
@smallexample
|
|
@{
|
|
mbstate_t state;
|
|
memset (&state, '\0', sizeof (state));
|
|
/* @r{from now on @var{state} can be used.} */
|
|
...
|
|
@}
|
|
@end smallexample
|
|
|
|
When using the conversion functions to generate output it is often
|
|
necessary to test whether the current state corresponds to the initial
|
|
state. This is necessary, for example, to decide whether or not to emit
|
|
escape sequences to set the state to the initial state at certain
|
|
sequence points. Communication protocols often require this.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun int mbsinit (const mbstate_t *@var{ps})
|
|
This function determines whether the state object pointed to by @var{ps}
|
|
is in the initial state or not. If @var{ps} is a null pointer or the
|
|
object is in the initial state the return value is nonzero. Otherwise
|
|
it is zero.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and
|
|
is declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
Code using this function often looks similar to this:
|
|
|
|
@c Fix the example to explicitly say how to generate the escape sequence
|
|
@c to restore the initial state.
|
|
@smallexample
|
|
@{
|
|
mbstate_t state;
|
|
memset (&state, '\0', sizeof (state));
|
|
/* @r{Use @var{state}.} */
|
|
...
|
|
if (! mbsinit (&state))
|
|
@{
|
|
/* @r{Emit code to return to initial state.} */
|
|
const wchar_t empty[] = L"";
|
|
const wchar_t *srcp = empty;
|
|
wcsrtombs (outbuf, &srcp, outbuflen, &state);
|
|
@}
|
|
...
|
|
@}
|
|
@end smallexample
|
|
|
|
The code to emit the escape sequence to get back to the initial state is
|
|
interesting. The @code{wcsrtombs} function can be used to determine the
|
|
necessary output code (@pxref{Converting Strings}). Please note that on
|
|
GNU systems it is not necessary to perform this extra action for the
|
|
conversion from multibyte text to wide character text since the wide
|
|
character encoding is not stateful. But there is nothing mentioned in
|
|
any standard which prohibits making @code{wchar_t} using a stateful
|
|
encoding.
|
|
|
|
@node Converting a Character
|
|
@subsection Converting Single Characters
|
|
|
|
The most fundamental of the conversion functions are those dealing with
|
|
single characters. Please note that this does not always mean single
|
|
bytes. But since there is very often a subset of the multibyte
|
|
character set which consists of single byte sequences there are
|
|
functions to help with converting bytes. One very important and often
|
|
applicable scenario is where ASCII is a subpart of the multibyte
|
|
character set. I.e., all ASCII characters stand for itself and all
|
|
other characters have at least a first byte which is beyond the range
|
|
@math{0} to @math{127}.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun wint_t btowc (int @var{c})
|
|
The @code{btowc} function (``byte to wide character'') converts a valid
|
|
single byte character @var{c} in the initial shift state into the wide
|
|
character equivalent using the conversion rules from the currently
|
|
selected locale of the @code{LC_CTYPE} category.
|
|
|
|
If @code{(unsigned char) @var{c}} is no valid single byte multibyte
|
|
character or if @var{c} is @code{EOF} the function returns @code{WEOF}.
|
|
|
|
Please note the restriction of @var{c} being tested for validity only in
|
|
the initial shift state. There is no @code{mbstate_t} object used from
|
|
which the state information is taken and the function also does not use
|
|
any static state.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and
|
|
is declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
Despite the limitation that the single byte value always is interpreted
|
|
in the initial state this function is actually useful most of the time.
|
|
Most characters are either entirely single-byte character sets or they
|
|
are extension to ASCII. But then it is possible to write code like this
|
|
(not that this specific example is very useful):
|
|
|
|
@smallexample
|
|
wchar_t *
|
|
itow (unsigned long int val)
|
|
@{
|
|
static wchar_t buf[30];
|
|
wchar_t *wcp = &buf[29];
|
|
*wcp = L'\0';
|
|
while (val != 0)
|
|
@{
|
|
*--wcp = btowc ('0' + val % 10);
|
|
val /= 10;
|
|
@}
|
|
if (wcp == &buf[29])
|
|
*--wcp = L'0';
|
|
return wcp;
|
|
@}
|
|
@end smallexample
|
|
|
|
Why is it necessary to use such a complicated implementation and not
|
|
simply cast @code{'0' + val % 10} to a wide character? The answer is
|
|
that there is no guarantee that one can perform this kind of arithmetic
|
|
on the character of the character set used for @code{wchar_t}
|
|
representation. In other situations the bytes are not constant at
|
|
compile time and so the compiler cannot do the work. In situations like
|
|
this it is necessary @code{btowc}.
|
|
|
|
@noindent
|
|
There also is a function for the conversion in the other direction.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun int wctob (wint_t @var{c})
|
|
The @code{wctob} function (``wide character to byte'') takes as the
|
|
parameter a valid wide character. If the multibyte representation for
|
|
this character in the initial state is exactly one byte long the return
|
|
value of this function is this character. Otherwise the return value is
|
|
@code{EOF}.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and
|
|
is declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
There are more general functions to convert single character from
|
|
multibyte representation to wide characters and vice versa. These
|
|
functions pose no limit on the length of the multibyte representation
|
|
and they also do not require it to be in the initial state.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
|
|
@cindex stateful
|
|
The @code{mbrtowc} function (``multibyte restartable to wide
|
|
character'') converts the next multibyte character in the string pointed
|
|
to by @var{s} into a wide character and stores it in the wide character
|
|
string pointed to by @var{pwc}. The conversion is performed according
|
|
to the locale currently selected for the @code{LC_CTYPE} category. If
|
|
the conversion for the character set used in the locale requires a state
|
|
the multibyte string is interpreted in the state represented by the
|
|
object pointed to by @var{ps}. If @var{ps} is a null pointer, a static,
|
|
internal state variable used only by the @code{mbrtowc} function is
|
|
used.
|
|
|
|
If the next multibyte character corresponds to the NUL wide character
|
|
the return value of the function is @math{0} and the state object is
|
|
afterwards in the initial state. If the next @var{n} or fewer bytes
|
|
form a correct multibyte character the return value is the number of
|
|
bytes starting from @var{s} which form the multibyte character. The
|
|
conversion state is updated according to the bytes consumed in the
|
|
conversion. In both cases the wide character (either the @code{L'\0'}
|
|
or the one found in the conversion) is stored in the string pointer to
|
|
by @var{pwc} iff @var{pwc} is not null.
|
|
|
|
If the first @var{n} bytes of the multibyte string possibly form a valid
|
|
multibyte character but there are more than @var{n} bytes needed to
|
|
complete it the return value of the function is @code{(size_t) -2} and
|
|
no value is stored. Please note that this can happen even if @var{n}
|
|
has a value greater or equal to @code{MB_CUR_MAX} since the input might
|
|
contain redundant shift sequences.
|
|
|
|
If the first @code{n} bytes of the multibyte string cannot possibly form
|
|
a valid multibyte character also no value is stored, the global variable
|
|
@code{errno} is set to the value @code{EILSEQ} and the function returns
|
|
@code{(size_t) -1}. The conversion state is afterwards undefined.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and
|
|
is declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
Using this function is straight forward. A function which copies a
|
|
multibyte string into a wide character string while at the same time
|
|
converting all lowercase character into uppercase could look like this
|
|
(this is not the final version, just an example; it has no error
|
|
checking, and leaks sometimes memory):
|
|
|
|
@smallexample
|
|
wchar_t *
|
|
mbstouwcs (const char *s)
|
|
@{
|
|
size_t len = strlen (s);
|
|
wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
|
|
wchar_t *wcp = result;
|
|
wchar_t tmp[1];
|
|
mbstate_t state;
|
|
memset (&state, '\0', sizeof (state));
|
|
size_t nbytes;
|
|
while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
|
|
@{
|
|
if (nbytes >= (size_t) -2)
|
|
/* Invalid input string. */
|
|
return NULL;
|
|
*result++ = towupper (tmp[0]);
|
|
len -= nbytes;
|
|
s += nbytes;
|
|
@}
|
|
return result;
|
|
@}
|
|
@end smallexample
|
|
|
|
The use of @code{mbrtowc} should be clear. A single wide character is
|
|
stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored
|
|
in the variable @var{nbytes}. In case the the conversion was successful
|
|
the uppercase variant of the wide character is stored in the
|
|
@var{result} array and the pointer to the input string and the number of
|
|
available bytes is adjusted.
|
|
|
|
The only non-obvious thing about the function might be the way memory is
|
|
allocated for the result. The above code uses the fact that there can
|
|
never be more wide characters in the converted results than there are
|
|
bytes in the multibyte input string. This method yields to a
|
|
pessimistic guess about the size of the result and if many wide
|
|
character strings have to be constructed this way or the strings are
|
|
long, the extra memory required allocated because the input string
|
|
contains multibyte characters might be significant. It would be
|
|
possible to resize the allocated memory block to the correct size before
|
|
returning it. A better solution might be to allocate just the right
|
|
amount of space for the result right away. Unfortunately there is no
|
|
function to compute the length of the wide character string directly
|
|
from the multibyte string. But there is a function which does part of
|
|
the work.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
|
|
The @code{mbrlen} function (``multibyte restartable length'') computes
|
|
the number of at most @var{n} bytes starting at @var{s} which form the
|
|
next valid and complete multibyte character.
|
|
|
|
If the next multibyte character corresponds to the NUL wide character
|
|
the return value is @math{0}. If the next @var{n} bytes form a valid
|
|
multibyte character the number of bytes belonging to this multibyte
|
|
character byte sequence is returned.
|
|
|
|
If the the first @var{n} bytes possibly form a valid multibyte
|
|
character but it is incomplete the return value is @code{(size_t) -2}.
|
|
Otherwise the multibyte character sequence is invalid and the return
|
|
value is @code{(size_t) -1}.
|
|
|
|
The multibyte sequence is interpreted in the state represented by the
|
|
object pointed to by @var{ps}. If @var{ps} is a null pointer, a state
|
|
object local to @code{mbrlen} is used.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and
|
|
is declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
The tentative reader now will of course note that @code{mbrlen} can be
|
|
implemented as
|
|
|
|
@smallexample
|
|
mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
|
|
@end smallexample
|
|
|
|
This is true and in fact is mentioned in the official specification.
|
|
Now, how can this function be used to determine the length of the wide
|
|
character string created from a multibyte character string? It is not
|
|
directly usable but we can define a function @code{mbslen} using it:
|
|
|
|
@smallexample
|
|
size_t
|
|
mbslen (const char *s)
|
|
@{
|
|
mbstate_t state;
|
|
size_t result = 0;
|
|
size_t nbytes;
|
|
memset (&state, '\0', sizeof (state));
|
|
while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
|
|
@{
|
|
if (nbytes >= (size_t) -2)
|
|
/* @r{Something is wrong.} */
|
|
return (size_t) -1;
|
|
s += nbytes;
|
|
++result;
|
|
@}
|
|
return result;
|
|
@}
|
|
@end smallexample
|
|
|
|
This function simply calls @code{mbrlen} for each multibyte character
|
|
in the string and counts the number of function calls. Please note that
|
|
we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
|
|
call. This is OK since a) this value is larger then the length of the
|
|
longest multibyte character sequence and b) because we know that the
|
|
string @var{s} ends with a NUL byte which cannot be part of any other
|
|
multibyte character sequence but the one representing the NUL wide
|
|
character. Therefore the @code{mbrlen} function will never read invalid
|
|
memory.
|
|
|
|
Now that this function is available (just to make this clear, this
|
|
function is @emph{not} part of the GNU C library) we can compute the
|
|
number of wide character required to store the converted multibyte
|
|
character string @var{s} using
|
|
|
|
@smallexample
|
|
wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
|
|
@end smallexample
|
|
|
|
Please note that the @code{mbslen} function is quite inefficient. The
|
|
implementation of @code{mbstouwcs} implemented using @code{mbslen} would
|
|
have to perform the conversion of the multibyte character input string
|
|
twice and this conversion might be quite expensive. So it is necessary
|
|
to think about the consequences of using the easier but imprecise method
|
|
before doing the work twice.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
|
|
The @code{wcrtomb} function (``wide character restartable to
|
|
multibyte'') converts a single wide character into a multibyte string
|
|
corresponding to that wide character.
|
|
|
|
If @var{s} is a null pointer the function resets the the state stored in
|
|
the objects pointer to by @var{ps} (or the internal @code{mbstate_t}
|
|
object) to the initial state. This can also be achieved by a call like
|
|
this:
|
|
|
|
@smallexample
|
|
wcrtombs (temp_buf, L'\0', ps)
|
|
@end smallexample
|
|
|
|
@noindent
|
|
since if @var{s} is a null pointer @code{wcrtomb} performs as if it
|
|
writes into an internal buffer which is guaranteed to be large enough.
|
|
|
|
If @var{wc} is the NUL wide character @code{wcrtomb} emits, if
|
|
necessary, a shift sequence to get the state @var{ps} into the initial
|
|
state followed by a single NUL byte is stored in the string @var{s}.
|
|
|
|
Otherwise a byte sequence (possibly including shift sequences) is
|
|
written into the string @var{s}. This of only happens if @var{wc} is a
|
|
valid wide character, i.e., it has a multibyte representation in the
|
|
character set selected by locale of the @code{LC_CTYPE} category. If
|
|
@var{wc} is no valid wide character nothing is stored in the strings
|
|
@var{s}, @code{errno} is set to @code{EILSEQ}, the conversion state in
|
|
@var{ps} is undefined and the return value is @code{(size_t) -1}.
|
|
|
|
If no error occurred the function returns the number of bytes stored in
|
|
the string @var{s}. This includes all byte representing shift
|
|
sequences.
|
|
|
|
One word about the interface of the function: there is no parameter
|
|
specifying the length of the array @var{s}. Instead the function
|
|
assumes that there are at least @code{MB_CUR_MAX} bytes available since
|
|
this is the maximum length of any byte sequence representing a single
|
|
character. So the caller has to make sure that there is enough space
|
|
available, otherwise buffer overruns can occur.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
|
|
declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
Using this function is as easy as using @code{mbrtowc}. The following
|
|
example appends a wide character string to a multibyte character string.
|
|
Again, the code is not really useful (and correct), it is simply here to
|
|
demonstrate the use and some problems.
|
|
|
|
@smallexample
|
|
char *
|
|
mbscatwcs (char *s, size_t len, const wchar_t *ws)
|
|
@{
|
|
mbstate_t state;
|
|
/* @r{Find the end of the existing string.} */
|
|
char *wp = strchr (s, '\0');
|
|
len -= wp - s;
|
|
memset (&state, '\0', sizeof (state));
|
|
do
|
|
@{
|
|
size_t nbytes;
|
|
if (len < MB_CUR_LEN)
|
|
@{
|
|
/* @r{We cannot guarantee that the next}
|
|
@r{character fits into the buffer, so}
|
|
@r{return an error.} */
|
|
errno = E2BIG;
|
|
return NULL;
|
|
@}
|
|
nbytes = wcrtomb (wp, *ws, &state);
|
|
if (nbytes == (size_t) -1)
|
|
/* @r{Error in the conversion.} */
|
|
return NULL;
|
|
len -= nbytes;
|
|
wp += nbytes;
|
|
@}
|
|
while (*ws++ != L'\0');
|
|
return s;
|
|
@}
|
|
@end smallexample
|
|
|
|
First the function has to find the end of the string currently in the
|
|
array @var{s}. The @code{strchr} call does this very efficiently since a
|
|
requirement for multibyte character representations is that the NUL byte
|
|
never is used except to represent itself (and in this context, the end
|
|
of the string).
|
|
|
|
After initializing the state object the loop is entered where the first
|
|
task is to make sure there is enough room in the array @var{s}. We
|
|
abort if there are not at least @code{MB_CUR_LEN} bytes available. This
|
|
is not always optimal but we have no other choice. We might have less
|
|
than @code{MB_CUR_LEN} bytes available but the next multibyte character
|
|
might also be only one byte long. At the time the @code{wcrtomb} call
|
|
returns it is too late to decide whether the buffer was large enough or
|
|
not. If this solution is really unsuitable there is a very slow but
|
|
more accurate solution.
|
|
|
|
@smallexample
|
|
...
|
|
if (len < MB_CUR_LEN)
|
|
@{
|
|
mbstate_t temp_state;
|
|
memcpy (&temp_state, &state, sizeof (state));
|
|
if (wcrtomb (NULL, *ws, &temp_state) > len)
|
|
@{
|
|
/* @r{We cannot guarantee that the next}
|
|
@r{character fits into the buffer, so}
|
|
@r{return an error.} */
|
|
errno = E2BIG;
|
|
return NULL;
|
|
@}
|
|
@}
|
|
...
|
|
@end smallexample
|
|
|
|
Here we do perform the conversion which might overflow the buffer so
|
|
that we are afterwards in the position to make an exact decision about
|
|
the buffer size. Please note the @code{NULL} argument for the
|
|
destination buffer in the new @code{wcrtomb} call; since we are not
|
|
interested in the converted text at this point this is a nice way to
|
|
express this. The most unusual thing about this piece of code certainly
|
|
is the duplication of the conversion state object. But think about
|
|
this: if a change of the state is necessary to emit the next multibyte
|
|
character we want to have the same shift state change performed in the
|
|
real conversion. Therefore we have to preserve the initial shift state
|
|
information.
|
|
|
|
There are certainly many more and even better solutions to this problem.
|
|
This example is only meant for educational purposes.
|
|
|
|
@node Converting Strings
|
|
@subsection Converting Multibyte and Wide Character Strings
|
|
|
|
The functions described in the previous section only convert a single
|
|
character at a time. Most operations to be performed in real-world
|
|
programs include strings and therefore the @w{ISO C} standard also
|
|
defines conversions on entire strings. However, the defined set of
|
|
functions is quite limited, thus the GNU C library contains a few
|
|
extensions which can help in some important situations.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
|
|
The @code{mbsrtowcs} function (``multibyte string restartable to wide
|
|
character string'') converts an NUL terminated multibyte character
|
|
string at @code{*@var{src}} into an equivalent wide character string,
|
|
including the NUL wide character at the end. The conversion is started
|
|
using the state information from the object pointed to by @var{ps} or
|
|
from an internal object of @code{mbsrtowcs} if @var{ps} is a null
|
|
pointer. Before returning the state object to match the state after the
|
|
last converted character. The state is the initial state if the
|
|
terminating NUL byte is reached and converted.
|
|
|
|
If @var{dst} is not a null pointer the result is stored in the array
|
|
pointed to by @var{dst}, otherwise the conversion result is not
|
|
available since it is stored in an internal buffer.
|
|
|
|
If @var{len} wide characters are stored in the array @var{dst} before
|
|
reaching the end of the input string the conversion stops and @var{len}
|
|
is returned. If @var{dst} is a null pointer @var{len} is never checked.
|
|
|
|
Another reason for a premature return from the function call is if the
|
|
input string contains an invalid multibyte sequence. In this case the
|
|
global variable @code{errno} is set to @code{EILSEQ} and the function
|
|
returns @code{(size_t) -1}.
|
|
|
|
@c XXX The ISO C9x draft seems to have a problem here. It says that PS
|
|
@c is not updated if DST is NULL. This is not said straight forward and
|
|
@c none of the other functions is described like this. It would make sense
|
|
@c to define the function this way but I don't think it is meant like this.
|
|
|
|
In all other cases the function returns the number of wide characters
|
|
converted during this call. If @var{dst} is not null @code{mbsrtowcs}
|
|
stores in the pointer pointed to by @var{src} a null pointer (if the NUL
|
|
byte in the input string was reached) or the address of the byte
|
|
following the last converted multibyte character.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
|
|
declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
The definition of this function has one limitation which has to be
|
|
understood. The requirement that @var{dst} has to be a NUL terminated
|
|
string provides problems if one wants to convert buffers with text. A
|
|
buffer is normally no collection of NUL terminated strings but instead a
|
|
continuous collection of lines, separated by newline characters. Now
|
|
assume a function to convert one line from a buffer is needed. Since
|
|
the line is not NUL terminated the source pointer cannot directly point
|
|
into the unmodified text buffer. This means, either one inserts the NUL
|
|
byte at the appropriate place for the time of the @code{mbsrtowcs}
|
|
function call (which is not doable for a read-only buffer or in a
|
|
multi-threaded application) or one copies the line in an extra buffer
|
|
where it can be terminated by a NUL byte. Note that it is not in
|
|
general possible to limit the number of characters to convert by setting
|
|
the parameter @var{len} to any specific value. Since it is not known
|
|
how many bytes each multibyte character sequence is in length one always
|
|
could do only a guess.
|
|
|
|
@cindex stateful
|
|
There is still a problem with the method of NUL-terminating a line right
|
|
after the newline character which could lead to very strange results.
|
|
As said in the description of the @var{mbsrtowcs} function above the
|
|
conversion state is guaranteed to be in the initial shift state after
|
|
processing the NUL byte at the end of the input string. But this NUL
|
|
byte is not really part of the text. I.e., the conversion state after
|
|
the newline in the original text could be something different than the
|
|
initial shift state and therefore the first character of the next line
|
|
is encoded using this state. But the state in question is never
|
|
accessible to the user since the conversion stops after the NUL byte
|
|
(which resets the state). Most stateful character sets in use today
|
|
require that the shift state after a newline is the initial state--but
|
|
this is not a strict guarantee. Therefore simply NUL terminating a
|
|
piece of a running text is not always an adequate solution and therefore
|
|
never should be used in generally used code.
|
|
|
|
The generic conversion interface (@pxref{Generic Charset Conversion})
|
|
does not have this limitation (it simply works on buffers, not
|
|
strings), and the GNU C library contains a set of functions which take
|
|
additional parameters specifying the maximal number of bytes which are
|
|
consumed from the input string. This way the problem of
|
|
@code{mbsrtowcs}'s example above could be solved by determining the line
|
|
length and passing this length to the function.
|
|
|
|
@comment wchar.h
|
|
@comment ISO
|
|
@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
|
|
The @code{wcsrtombs} function (``wide character string restartable to
|
|
multibyte string'') converts the NUL terminated wide character string at
|
|
@code{*@var{src}} into an equivalent multibyte character string and
|
|
stores the result in the array pointed to by @var{dst}. The NUL wide
|
|
character is also converted. The conversion starts in the state
|
|
described in the object pointed to by @var{ps} or by a state object
|
|
locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If
|
|
@var{dst} is a null pointer the conversion is performed as usual but the
|
|
result is not available. If all characters of the input string were
|
|
successfully converted and if @var{dst} is not a null pointer the
|
|
pointer pointed to by @var{src} gets assigned a null pointer.
|
|
|
|
If one of the wide characters in the input string has no valid multibyte
|
|
character equivalent the conversion stops early, sets the global
|
|
variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
|
|
|
|
Another reason for a premature stop is if @var{dst} is not a null
|
|
pointer and the next converted character would require more than
|
|
@var{len} bytes in total to the array @var{dst}. In this case (and if
|
|
@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
|
|
assigned a value pointing to the wide character right after the last one
|
|
successfully converted.
|
|
|
|
Except in the case of an encoding error the return value of the function
|
|
is the number of bytes in all the multibyte character sequences stored
|
|
in @var{dst}. Before returning the state in the object pointed to by
|
|
@var{ps} (or the internal object in case @var{ps} is a null pointer) is
|
|
updated to reflect the state after the last conversion. The state is
|
|
the initial shift state in case the terminating NUL wide character was
|
|
converted.
|
|
|
|
@pindex wchar.h
|
|
This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
|
|
declared in @file{wchar.h}.
|
|
@end deftypefun
|
|
|
|
The restriction mentions above for the @code{mbsrtowcs} function applies
|
|
also here. There is no possibility to directly control the number of
|
|
input characters. One has to place the NUL wide character at the
|
|
correct place or control the consumed input indirectly via the available
|
|
output array size (the @var{len} parameter).
|
|
|
|
@comment wchar.h
|
|
@comment GNU
|
|
@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
|
|
The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
|
|
function. All the parameters are the same except for @var{nmc} which is
|
|
new. The return value is the same as for @code{mbsrtowcs}.
|
|
|
|
This new parameter specifies how many bytes at most can be used from the
|
|
multibyte character string. I.e., the multibyte character string
|
|
@code{*@var{src}} need not be NUL terminated. But if a NUL byte is
|
|
found within the @var{nmc} first bytes of the string the conversion
|
|
stops here.
|
|
|
|
This function is a GNU extensions. It is meant to work around the
|
|
problems mentioned above. Now it is possible to convert buffer with
|
|
multibyte character text piece for piece without having to care about
|
|
inserting NUL bytes and the effect of NUL bytes on the conversion state.
|
|
@end deftypefun
|
|
|
|
A function to convert a multibyte string into a wide character string
|
|
and display it could be written like this (this is not a really useful
|
|
example):
|
|
|
|
@smallexample
|
|
void
|
|
showmbs (const char *src, FILE *fp)
|
|
@{
|
|
mbstate_t state;
|
|
int cnt = 0;
|
|
memset (&state, '\0', sizeof (state));
|
|
while (1)
|
|
@{
|
|
wchar_t linebuf[100];
|
|
const char *endp = strchr (src, '\n');
|
|
size_t n;
|
|
|
|
/* @r{Exit if there is no more line.} */
|
|
if (endp == NULL)
|
|
break;
|
|
|
|
n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
|
|
linebuf[n] = L'\0';
|
|
fprintf (fp, "line %d: \"%S\"\n", linebuf);
|
|
@}
|
|
@}
|
|
@end smallexample
|
|
|
|
There is no problem with the state after a call to @code{mbsnrtowcs}.
|
|
Since we don't insert characters in the strings which were not in there
|
|
right from the beginning and we use @var{state} only for the conversion
|
|
of the given buffer there is no problem with altering the state.
|
|
|
|
@comment wchar.h
|
|
@comment GNU
|
|
@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
|
|
The @code{wcsnrtombs} function implements the conversion from wide
|
|
character strings to multibyte character strings. It is similar to
|
|
@code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra
|
|
parameter which specifies the length of the input string.
|
|
|
|
No more than @var{nwc} wide characters from the input string
|
|
@code{*@var{src}} are converted. If the input string contains a NUL
|
|
wide character in the first @var{nwc} character to conversion stops at
|
|
this place.
|
|
|
|
This function is a GNU extension and just like @code{mbsnrtowcs} is
|
|
helps in situations where no NUL terminated input strings are available.
|
|
@end deftypefun
|
|
|
|
|
|
@node Multibyte Conversion Example
|
|
@subsection A Complete Multibyte Conversion Example
|
|
|
|
The example programs given in the last sections are only brief and do
|
|
not contain all the error checking etc. Presented here is a complete
|
|
and documented example. It features the @code{mbrtowc} function but it
|
|
should be easy to derive versions using the other functions.
|
|
|
|
@smallexample
|
|
int
|
|
file_mbsrtowcs (int input, int output)
|
|
@{
|
|
/* @r{Note the use of @code{MB_LEN_MAX}.}
|
|
@r{@code{MB_CUR_MAX} cannot portably be used here.} */
|
|
char buffer[BUFSIZ + MB_LEN_MAX];
|
|
mbstate_t state;
|
|
int filled = 0;
|
|
int eof = 0;
|
|
|
|
/* @r{Initialize the state.} */
|
|
memset (&state, '\0', sizeof (state));
|
|
|
|
while (!eof)
|
|
@{
|
|
ssize_t nread;
|
|
ssize_t nwrite;
|
|
char *inp = buffer;
|
|
wchar_t outbuf[BUFSIZ];
|
|
wchar_t *outp = outbuf;
|
|
|
|
/* @r{Fill up the buffer from the input file.} */
|
|
nread = read (input, buffer + filled, BUFSIZ);
|
|
if (nread < 0)
|
|
@{
|
|
perror ("read");
|
|
return 0;
|
|
@}
|
|
/* @r{If we reach end of file, make a note to read no more.} */
|
|
if (nread == 0)
|
|
eof = 1;
|
|
|
|
/* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
|
|
filled += nread;
|
|
|
|
/* @r{Convert those bytes to wide characters--as many as we can.} */
|
|
while (1)
|
|
@{
|
|
size_t thislen = mbrtowc (outp, inp, filled, &state);
|
|
/* @r{Stop converting at invalid character;}
|
|
@r{this can mean we have read just the first part}
|
|
@r{of a valid character.} */
|
|
if (thislen == (size_t) -1)
|
|
break;
|
|
/* @r{We want to handle embedded NUL bytes}
|
|
@r{but the return value is 0. Correct this.} */
|
|
if (thislen == 0)
|
|
thislen = 1;
|
|
/* @r{Advance past this character.} */
|
|
inp += thislen;
|
|
filled -= thislen;
|
|
++outp;
|
|
@}
|
|
|
|
/* @r{Write the wide characters we just made.} */
|
|
nwrite = write (output, outbuf,
|
|
(outp - outbuf) * sizeof (wchar_t));
|
|
if (nwrite < 0)
|
|
@{
|
|
perror ("write");
|
|
return 0;
|
|
@}
|
|
|
|
/* @r{See if we have a @emph{real} invalid character.} */
|
|
if ((eof && filled > 0) || filled >= MB_CUR_MAX)
|
|
@{
|
|
error (0, 0, "invalid multibyte character");
|
|
return 0;
|
|
@}
|
|
|
|
/* @r{If any characters must be carried forward,}
|
|
@r{put them at the beginning of @code{buffer}.} */
|
|
if (filled > 0)
|
|
memmove (inp, buffer, filled);
|
|
@}
|
|
|
|
return 1;
|
|
@}
|
|
@end smallexample
|
|
|
|
|
|
@node Non-reentrant Conversion
|
|
@section Non-reentrant Conversion Function
|
|
|
|
The functions described in the last chapter are defined in
|
|
@w{Amendment 1} to @w{ISO C90}. But the original @w{ISO C90} standard also
|
|
contained functions for character set conversion. The reason that they
|
|
are not described in the first place is that they are almost entirely
|
|
useless.
|
|
|
|
The problem is that all the functions for conversion defined in @w{ISO
|
|
C90} use a local state. This implies that multiple conversions at the
|
|
same time (not only when using threads) cannot be done, and that you
|
|
cannot first convert single characters and then strings since you cannot
|
|
tell the conversion functions which state to use.
|
|
|
|
These functions are therefore usable only in a very limited set of
|
|
situations. One must complete converting the entire string before
|
|
starting a new one and each string/text must be converted with the same
|
|
function (there is no problem with the library itself; it is guaranteed
|
|
that no library function changes the state of any of these functions).
|
|
@strong{For the above reasons it is highly requested that the functions
|
|
from the last section are used in place of non-reentrant conversion
|
|
functions.}
|
|
|
|
@menu
|
|
* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single
|
|
Characters.
|
|
* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings.
|
|
* Shift State:: States in Non-reentrant Functions.
|
|
@end menu
|
|
|
|
@node Non-reentrant Character Conversion
|
|
@subsection Non-reentrant Conversion of Single Characters
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})
|
|
The @code{mbtowc} (``multibyte to wide character'') function when called
|
|
with non-null @var{string} converts the first multibyte character
|
|
beginning at @var{string} to its corresponding wide character code. It
|
|
stores the result in @code{*@var{result}}.
|
|
|
|
@code{mbtowc} never examines more than @var{size} bytes. (The idea is
|
|
to supply for @var{size} the number of bytes of data you have in hand.)
|
|
|
|
@code{mbtowc} with non-null @var{string} distinguishes three
|
|
possibilities: the first @var{size} bytes at @var{string} start with
|
|
valid multibyte character, they start with an invalid byte sequence or
|
|
just part of a character, or @var{string} points to an empty string (a
|
|
null character).
|
|
|
|
For a valid multibyte character, @code{mbtowc} converts it to a wide
|
|
character and stores that in @code{*@var{result}}, and returns the
|
|
number of bytes in that character (always at least @math{1}, and never
|
|
more than @var{size}).
|
|
|
|
For an invalid byte sequence, @code{mbtowc} returns @math{-1}. For an
|
|
empty string, it returns @math{0}, also storing @code{'\0'} in
|
|
@code{*@var{result}}.
|
|
|
|
If the multibyte character code uses shift characters, then
|
|
@code{mbtowc} maintains and updates a shift state as it scans. If you
|
|
call @code{mbtowc} with a null pointer for @var{string}, that
|
|
initializes the shift state to its standard initial value. It also
|
|
returns nonzero if the multibyte character code in use actually has a
|
|
shift state. @xref{Shift State}.
|
|
@end deftypefun
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
|
|
The @code{wctomb} (``wide character to multibyte'') function converts
|
|
the wide character code @var{wchar} to its corresponding multibyte
|
|
character sequence, and stores the result in bytes starting at
|
|
@var{string}. At most @code{MB_CUR_MAX} characters are stored.
|
|
|
|
@code{wctomb} with non-null @var{string} distinguishes three
|
|
possibilities for @var{wchar}: a valid wide character code (one that can
|
|
be translated to a multibyte character), an invalid code, and @code{L'\0'}.
|
|
|
|
Given a valid code, @code{wctomb} converts it to a multibyte character,
|
|
storing the bytes starting at @var{string}. Then it returns the number
|
|
of bytes in that character (always at least @math{1}, and never more
|
|
than @code{MB_CUR_MAX}).
|
|
|
|
If @var{wchar} is an invalid wide character code, @code{wctomb} returns
|
|
@math{-1}. If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
|
|
storing @code{'\0'} in @code{*@var{string}}.
|
|
|
|
If the multibyte character code uses shift characters, then
|
|
@code{wctomb} maintains and updates a shift state as it scans. If you
|
|
call @code{wctomb} with a null pointer for @var{string}, that
|
|
initializes the shift state to its standard initial value. It also
|
|
returns nonzero if the multibyte character code in use actually has a
|
|
shift state. @xref{Shift State}.
|
|
|
|
Calling this function with a @var{wchar} argument of zero when
|
|
@var{string} is not null has the side-effect of reinitializing the
|
|
stored shift state @emph{as well as} storing the multibyte character
|
|
@code{'\0'} and returning @math{0}.
|
|
@end deftypefun
|
|
|
|
Similar to @code{mbrlen} there is also a non-reentrant function which
|
|
computes the length of a multibyte character. It can be defined in
|
|
terms of @code{mbtowc}.
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypefun int mblen (const char *@var{string}, size_t @var{size})
|
|
The @code{mblen} function with a non-null @var{string} argument returns
|
|
the number of bytes that make up the multibyte character beginning at
|
|
@var{string}, never examining more than @var{size} bytes. (The idea is
|
|
to supply for @var{size} the number of bytes of data you have in hand.)
|
|
|
|
The return value of @code{mblen} distinguishes three possibilities: the
|
|
first @var{size} bytes at @var{string} start with valid multibyte
|
|
character, they start with an invalid byte sequence or just part of a
|
|
character, or @var{string} points to an empty string (a null character).
|
|
|
|
For a valid multibyte character, @code{mblen} returns the number of
|
|
bytes in that character (always at least @code{1}, and never more than
|
|
@var{size}). For an invalid byte sequence, @code{mblen} returns
|
|
@math{-1}. For an empty string, it returns @math{0}.
|
|
|
|
If the multibyte character code uses shift characters, then @code{mblen}
|
|
maintains and updates a shift state as it scans. If you call
|
|
@code{mblen} with a null pointer for @var{string}, that initializes the
|
|
shift state to its standard initial value. It also returns a nonzero
|
|
value if the multibyte character code in use actually has a shift state.
|
|
@xref{Shift State}.
|
|
|
|
@pindex stdlib.h
|
|
The function @code{mblen} is declared in @file{stdlib.h}.
|
|
@end deftypefun
|
|
|
|
|
|
@node Non-reentrant String Conversion
|
|
@subsection Non-reentrant Conversion of Strings
|
|
|
|
For convenience reasons the @w{ISO C90} standard defines also functions
|
|
to convert entire strings instead of single characters. These functions
|
|
suffer from the same problems as their reentrant counterparts from
|
|
@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
|
|
The @code{mbstowcs} (``multibyte string to wide character string'')
|
|
function converts the null-terminated string of multibyte characters
|
|
@var{string} to an array of wide character codes, storing not more than
|
|
@var{size} wide characters into the array beginning at @var{wstring}.
|
|
The terminating null character counts towards the size, so if @var{size}
|
|
is less than the actual number of wide characters resulting from
|
|
@var{string}, no terminating null character is stored.
|
|
|
|
The conversion of characters from @var{string} begins in the initial
|
|
shift state.
|
|
|
|
If an invalid multibyte character sequence is found, this function
|
|
returns a value of @math{-1}. Otherwise, it returns the number of wide
|
|
characters stored in the array @var{wstring}. This number does not
|
|
include the terminating null character, which is present if the number
|
|
is less than @var{size}.
|
|
|
|
Here is an example showing how to convert a string of multibyte
|
|
characters, allocating enough space for the result.
|
|
|
|
@smallexample
|
|
wchar_t *
|
|
mbstowcs_alloc (const char *string)
|
|
@{
|
|
size_t size = strlen (string) + 1;
|
|
wchar_t *buf = xmalloc (size * sizeof (wchar_t));
|
|
|
|
size = mbstowcs (buf, string, size);
|
|
if (size == (size_t) -1)
|
|
return NULL;
|
|
buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
|
|
return buf;
|
|
@}
|
|
@end smallexample
|
|
|
|
@end deftypefun
|
|
|
|
@comment stdlib.h
|
|
@comment ISO
|
|
@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
|
|
The @code{wcstombs} (``wide character string to multibyte string'')
|
|
function converts the null-terminated wide character array @var{wstring}
|
|
into a string containing multibyte characters, storing not more than
|
|
@var{size} bytes starting at @var{string}, followed by a terminating
|
|
null character if there is room. The conversion of characters begins in
|
|
the initial shift state.
|
|
|
|
The terminating null character counts towards the size, so if @var{size}
|
|
is less than or equal to the number of bytes needed in @var{wstring}, no
|
|
terminating null character is stored.
|
|
|
|
If a code that does not correspond to a valid multibyte character is
|
|
found, this function returns a value of @math{-1}. Otherwise, the
|
|
return value is the number of bytes stored in the array @var{string}.
|
|
This number does not include the terminating null character, which is
|
|
present if the number is less than @var{size}.
|
|
@end deftypefun
|
|
|
|
@node Shift State
|
|
@subsection States in Non-reentrant Functions
|
|
|
|
In some multibyte character codes, the @emph{meaning} of any particular
|
|
byte sequence is not fixed; it depends on what other sequences have come
|
|
earlier in the same string. Typically there are just a few sequences
|
|
that can change the meaning of other sequences; these few are called
|
|
@dfn{shift sequences} and we say that they set the @dfn{shift state} for
|
|
other sequences that follow.
|
|
|
|
To illustrate shift state and shift sequences, suppose we decide that
|
|
the sequence @code{0200} (just one byte) enters Japanese mode, in which
|
|
pairs of bytes in the range from @code{0240} to @code{0377} are single
|
|
characters, while @code{0201} enters Latin-1 mode, in which single bytes
|
|
in the range from @code{0240} to @code{0377} are characters, and
|
|
interpreted according to the ISO Latin-1 character set. This is a
|
|
multibyte code which has two alternative shift states (``Japanese mode''
|
|
and ``Latin-1 mode''), and two shift sequences that specify particular
|
|
shift states.
|
|
|
|
When the multibyte character code in use has shift states, then
|
|
@code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
|
|
the current shift state as they scan the string. To make this work
|
|
properly, you must follow these rules:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Before starting to scan a string, call the function with a null pointer
|
|
for the multibyte character address---for example, @code{mblen (NULL,
|
|
0)}. This initializes the shift state to its standard initial value.
|
|
|
|
@item
|
|
Scan the string one character at a time, in order. Do not ``back up''
|
|
and rescan characters already scanned, and do not intersperse the
|
|
processing of different strings.
|
|
@end itemize
|
|
|
|
Here is an example of using @code{mblen} following these rules:
|
|
|
|
@smallexample
|
|
void
|
|
scan_string (char *s)
|
|
@{
|
|
int length = strlen (s);
|
|
|
|
/* @r{Initialize shift state.} */
|
|
mblen (NULL, 0);
|
|
|
|
while (1)
|
|
@{
|
|
int thischar = mblen (s, length);
|
|
/* @r{Deal with end of string and invalid characters.} */
|
|
if (thischar == 0)
|
|
break;
|
|
if (thischar == -1)
|
|
@{
|
|
error ("invalid multibyte character");
|
|
break;
|
|
@}
|
|
/* @r{Advance past this character.} */
|
|
s += thischar;
|
|
length -= thischar;
|
|
@}
|
|
@}
|
|
@end smallexample
|
|
|
|
The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
|
|
reentrant when using a multibyte code that uses a shift state. However,
|
|
no other library functions call these functions, so you don't have to
|
|
worry that the shift state will be changed mysteriously.
|
|
|
|
|
|
@node Generic Charset Conversion
|
|
@section Generic Charset Conversion
|
|
|
|
The conversion functions mentioned so far in this chapter all had in
|
|
common that they operate on character sets which are not directly
|
|
specified by the functions. The multibyte encoding used is specified by
|
|
the currently selected locale for the @code{LC_CTYPE} category. The
|
|
wide character set is fixed by the implementation (in the case of GNU C
|
|
library it always is UCS-4 encoded @w{ISO 10646}.
|
|
|
|
This has of course several problems when it comes to general character
|
|
conversion:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
For every conversion where neither the source or destination character
|
|
set is the character set of the locale for the @code{LC_CTYPE} category,
|
|
one has to change the @code{LC_CTYPE} locale using @code{setlocale}.
|
|
|
|
This introduces major problems for the rest of the programs since
|
|
several more functions (e.g., the character classification functions,
|
|
@pxref{Classification of Characters}) use the @code{LC_CTYPE} category.
|
|
|
|
@item
|
|
Parallel conversions to and from different character sets are not
|
|
possible since the @code{LC_CTYPE} selection is global and shared by all
|
|
threads.
|
|
|
|
@item
|
|
If neither the source nor the destination character set is the character
|
|
set used for @code{wchar_t} representation there is at least a two-step
|
|
process necessary to convert a text using the functions above. One
|
|
would have to select the source character set as the multibyte encoding,
|
|
convert the text into a @code{wchar_t} text, select the destination
|
|
character set as the multibyte encoding and convert the wide character
|
|
text to the multibyte (@math{=} destination) character set.
|
|
|
|
Even if this is possible (which is not guaranteed) it is a very tiring
|
|
work. Plus it suffers from the other two raised points even more due to
|
|
the steady changing of the locale.
|
|
@end itemize
|
|
|
|
|
|
The XPG2 standard defines a completely new set of functions which has
|
|
none of these limitations. They are not at all coupled to the selected
|
|
locales and they but no constraints on the character sets selected for
|
|
source and destination. Only the set of available conversions is
|
|
limiting them. The standard does not specify that any conversion at all
|
|
must be available. It is a measure of the quality of the implementation.
|
|
|
|
In the following text first the interface to @code{iconv}, the
|
|
conversion function, will be described. Comparisons with other
|
|
implementations will show what pitfalls lie on the way of portable
|
|
applications. At last, the implementation is described as far as
|
|
interesting to the advanced user who wants to extend the conversion
|
|
capabilities.
|
|
|
|
@menu
|
|
* Generic Conversion Interface:: Generic Character Set Conversion Interface.
|
|
* iconv Examples:: A complete @code{iconv} example.
|
|
* Other iconv Implementations:: Some Details about other @code{iconv}
|
|
Implementations.
|
|
* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C
|
|
library.
|
|
@end menu
|
|
|
|
@node Generic Conversion Interface
|
|
@subsection Generic Character Set Conversion Interface
|
|
|
|
This set of functions follows the traditional cycle of using a resource:
|
|
open--use--close. The interface consists of three functions, each of
|
|
which implement one step.
|
|
|
|
Before the interfaces are described it is necessary to introduce a
|
|
datatype. Just like other open--use--close interface the functions
|
|
introduced here work using a handles and the @file{iconv.h} header
|
|
defines a special type for the handles used.
|
|
|
|
@comment iconv.h
|
|
@comment XPG2
|
|
@deftp {Data Type} iconv_t
|
|
This data type is an abstract type defined in @file{iconv.h}. The user
|
|
must not assume anything about the definition of this type, it must be
|
|
completely opaque.
|
|
|
|
Objects of this type can get assigned handles for the conversions using
|
|
the @code{iconv} functions. The objects themselves need not be freed but
|
|
the conversions for which the handles stand for have to.
|
|
@end deftp
|
|
|
|
@noindent
|
|
The first step is the function to create a handle.
|
|
|
|
@comment iconv.h
|
|
@comment XPG2
|
|
@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
|
|
The @code{iconv_open} function has to be used before starting a
|
|
conversion. The two parameters this function takes determine the
|
|
source and destination character set for the conversion and if the
|
|
implementation has the possibility to perform such a conversion the
|
|
function returns a handle.
|
|
|
|
If the wanted conversion is not available the function returns
|
|
@code{(iconv_t) -1}. In this case the global variable @code{errno} can
|
|
have the following values:
|
|
|
|
@table @code
|
|
@item EMFILE
|
|
The process already has @code{OPEN_MAX} file descriptors open.
|
|
@item ENFILE
|
|
The system limit of open file is reached.
|
|
@item ENOMEM
|
|
Not enough memory to carry out the operation.
|
|
@item EINVAL
|
|
The conversion from @var{fromcode} to @var{tocode} is not supported.
|
|
@end table
|
|
|
|
It is not possible to use the same descriptor in different threads to
|
|
perform independent conversions. Within the data structures associated
|
|
with the descriptor there is information about the conversion state.
|
|
This must not be messed up by using it in different conversions.
|
|
|
|
An @code{iconv} descriptor is like a file descriptor as for every use a
|
|
new descriptor must be created. The descriptor does not stand for all
|
|
of the conversions from @var{fromset} to @var{toset}.
|
|
|
|
The GNU C library implementation of @code{iconv_open} has one
|
|
significant extension to other implementations. To ease the extension
|
|
of the set of available conversions the implementation allows storing
|
|
the necessary files with data and code in arbitrarily many directories.
|
|
How this extension has to be written will be explained below
|
|
(@pxref{glibc iconv Implementation}). Here it is only important to say
|
|
that all directories mentioned in the @code{GCONV_PATH} environment
|
|
variable are considered if they contain a file @file{gconv-modules}.
|
|
These directories need not necessarily be created by the system
|
|
administrator. In fact, this extension is introduced to help users
|
|
writing and using their own, new conversions. Of course this does not work
|
|
for security reasons in SUID binaries; in this case only the system
|
|
directory is considered and this normally is
|
|
@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment
|
|
variable is examined exactly once at the first call of the
|
|
@code{iconv_open} function. Later modifications of the variable have no
|
|
effect.
|
|
|
|
@pindex iconv.h
|
|
This function got introduced early in the X/Open Portability Guide,
|
|
@w{version 2}. It is supported by all commercial Unices as it is
|
|
required for the Unix branding. However, the quality and completeness
|
|
of the implementation varies widely. The function is declared in
|
|
@file{iconv.h}.
|
|
@end deftypefun
|
|
|
|
The @code{iconv} implementation can associate large data structure with
|
|
the handle returned by @code{iconv_open}. Therefore it is crucial to
|
|
free all the resources once all conversions are carried out and the
|
|
conversion is not needed anymore.
|
|
|
|
@comment iconv.h
|
|
@comment XPG2
|
|
@deftypefun int iconv_close (iconv_t @var{cd})
|
|
The @code{iconv_close} function frees all resources associated with the
|
|
handle @var{cd} which must have been returned by a successful call to
|
|
the @code{iconv_open} function.
|
|
|
|
If the function call was successful the return value is @math{0}.
|
|
Otherwise it is @math{-1} and @code{errno} is set appropriately.
|
|
Defined error are:
|
|
|
|
@table @code
|
|
@item EBADF
|
|
The conversion descriptor is invalid.
|
|
@end table
|
|
|
|
@pindex iconv.h
|
|
This function was introduced together with the rest of the @code{iconv}
|
|
functions in XPG2 and it is declared in @file{iconv.h}.
|
|
@end deftypefun
|
|
|
|
The standard defines only one actual conversion function. This has
|
|
therefore the most general interface: it allows conversion from one
|
|
buffer to another. Conversion from a file to a buffer, vice versa, or
|
|
even file to file can be implemented on top of it.
|
|
|
|
@comment iconv.h
|
|
@comment XPG2
|
|
@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
|
|
@cindex stateful
|
|
The @code{iconv} function converts the text in the input buffer
|
|
according to the rules associated with the descriptor @var{cd} and
|
|
stores the result in the output buffer. It is possible to call the
|
|
function for the same text several times in a row since for stateful
|
|
character sets the necessary state information is kept in the data
|
|
structures associated with the descriptor.
|
|
|
|
The input buffer is specified by @code{*@var{inbuf}} and it contains
|
|
@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for
|
|
communicating the used input back to the caller (see below). It is
|
|
important to note that the buffer pointer is of type @code{char} and the
|
|
length is measured in bytes even if the input text is encoded in wide
|
|
characters.
|
|
|
|
The output buffer is specified in a similar way. @code{*@var{outbuf}}
|
|
points to the beginning of the buffer with at least
|
|
@code{*@var{outbytesleft}} bytes room for the result. The buffer
|
|
pointer again is of type @code{char} and the length is measured in
|
|
bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer the
|
|
conversion is performed but no output is available.
|
|
|
|
If @var{inbuf} is a null pointer the @code{iconv} function performs the
|
|
necessary action to put the state of the conversion into the initial
|
|
state. This is obviously a no-op for non-stateful encodings, but if the
|
|
encoding has a state such a function call might put some byte sequences
|
|
in the output buffer which perform the necessary state changes. The
|
|
next call with @var{inbuf} not being a null pointer then simply goes on
|
|
from the initial state. It is important that the programmer never makes
|
|
any assumption on whether the conversion has to deal with states or not.
|
|
Even if the input and output character sets are not stateful the
|
|
implementation might still have to keep states. This is due to the
|
|
implementation chosen for the GNU C library as it is described below.
|
|
Therefore an @code{iconv} call to reset the state should always be
|
|
performed if some protocol requires this for the output text.
|
|
|
|
The conversion stops for three reasons. The first is that all
|
|
characters from the input buffer are converted. This actually can mean
|
|
two things: really all bytes from the input buffer are consumed or
|
|
there are some bytes at the end of the buffer which possibly can form a
|
|
complete character but the input is incomplete. The second reason for a
|
|
stop is when the output buffer is full. And the third reason is that
|
|
the input contains invalid characters.
|
|
|
|
In all these cases the buffer pointers after the last successful
|
|
conversion, for input and output buffer, are stored in @var{inbuf} and
|
|
@var{outbuf} and the available room in each buffer is stored in
|
|
@var{inbytesleft} and @var{outbytesleft}.
|
|
|
|
Since the character sets selected in the @code{iconv_open} call can be
|
|
almost arbitrary there can be situations where the input buffer contains
|
|
valid characters which have no identical representation in the output
|
|
character set. The behavior in this situation is undefined. The
|
|
@emph{current} behavior of the GNU C library in this situation is to
|
|
return with an error immediately. This certainly is not the most
|
|
desirable solution. Therefore future versions will provide better ones
|
|
but they are not yet finished.
|
|
|
|
If all input from the input buffer is successfully converted and stored
|
|
in the output buffer the function returns the number of non-reversible
|
|
conversions performed. In all other cases the return value is
|
|
@code{(size_t) -1} and @code{errno} is set appropriately. In this case
|
|
the value pointed to by @var{inbytesleft} is nonzero.
|
|
|
|
@table @code
|
|
@item EILSEQ
|
|
The conversion stopped because of an invalid byte sequence in the input.
|
|
After the call @code{*@var{inbuf}} points at the first byte of the
|
|
invalid byte sequence.
|
|
|
|
@item E2BIG
|
|
The conversion stopped because it ran out of space in the output buffer.
|
|
|
|
@item EINVAL
|
|
The conversion stopped because of an incomplete byte sequence at the end
|
|
of the input buffer.
|
|
|
|
@item EBADF
|
|
The @var{cd} argument is invalid.
|
|
@end table
|
|
|
|
@pindex iconv.h
|
|
This function was introduced in the XPG2 standard and is declared in the
|
|
@file{iconv.h} header.
|
|
@end deftypefun
|
|
|
|
The definition of the @code{iconv} function is quite good overall. It
|
|
provides quite flexible functionality. The only problems lie in the
|
|
boundary cases which are incomplete byte sequences at the end of the
|
|
input buffer and invalid input. A third problem, which is not really
|
|
a design problem, is the way conversions are selected. The standard
|
|
does not say anything about the legitimate names, a minimal set of
|
|
available conversions. We will see how this negatively impacts other
|
|
implementations, as is demonstrated below.
|
|
|
|
|
|
@node iconv Examples
|
|
@subsection A complete @code{iconv} example
|
|
|
|
The example below features a solution for a common problem. Given that
|
|
one knows the internal encoding used by the system for @code{wchar_t}
|
|
strings one often is in the position to read text from a file and store
|
|
it in wide character buffers. One can do this using @code{mbsrtowcs}
|
|
but then we run into the problems discussed above.
|
|
|
|
@smallexample
|
|
int
|
|
file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
|
|
@{
|
|
char inbuf[BUFSIZ];
|
|
size_t insize = 0;
|
|
char *wrptr = (char *) outbuf;
|
|
int result = 0;
|
|
iconv_t cd;
|
|
|
|
cd = iconv_open ("UCS-4", charset);
|
|
if (cd == (iconv_t) -1)
|
|
@{
|
|
/* @r{Something went wrong.} */
|
|
if (errno == EINVAL)
|
|
error (0, 0, "conversion from '%s' to 'UCS-4' not available",
|
|
charset);
|
|
else
|
|
perror ("iconv_open");
|
|
|
|
/* @r{Terminate the output string.} */
|
|
*outbuf = L'\0';
|
|
|
|
return -1;
|
|
@}
|
|
|
|
while (avail > 0)
|
|
@{
|
|
size_t nread;
|
|
size_t nconv;
|
|
char *inptr = inbuf;
|
|
|
|
/* @r{Read more input.} */
|
|
nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
|
|
if (nread == 0)
|
|
@{
|
|
/* @r{When we come here the file is completely read.}
|
|
@r{This still could mean there are some unused}
|
|
@r{characters in the @code{inbuf}. Put them back.} */
|
|
if (lseek (fd, -insize, SEEK_CUR) == -1)
|
|
result = -1;
|
|
|
|
/* @r{Now write out the byte sequence to get into the}
|
|
@r{initial state if this is necessary.} */
|
|
iconv (cd, NULL, NULL, &wrptr, &avail);
|
|
|
|
break;
|
|
@}
|
|
insize += nread;
|
|
|
|
/* @r{Do the conversion.} */
|
|
nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
|
|
if (nconv == (size_t) -1)
|
|
@{
|
|
/* @r{Not everything went right. It might only be}
|
|
@r{an unfinished byte sequence at the end of the}
|
|
@r{buffer. Or it is a real problem.} */
|
|
if (errno == EINVAL)
|
|
/* @r{This is harmless. Simply move the unused}
|
|
@r{bytes to the beginning of the buffer so that}
|
|
@r{they can be used in the next round.} */
|
|
memmove (inbuf, inptr, insize);
|
|
else
|
|
@{
|
|
/* @r{It is a real problem. Maybe we ran out of}
|
|
@r{space in the output buffer or we have invalid}
|
|
@r{input. In any case back the file pointer to}
|
|
@r{the position of the last processed byte.} */
|
|
lseek (fd, -insize, SEEK_CUR);
|
|
result = -1;
|
|
break;
|
|
@}
|
|
@}
|
|
@}
|
|
|
|
/* @r{Terminate the output string.} */
|
|
if (avail >= sizeof (wchar_t))
|
|
*((wchar_t *) wrptr) = L'\0';
|
|
|
|
if (iconv_close (cd) != 0)
|
|
perror ("iconv_close");
|
|
|
|
return (wchar_t *) wrptr - outbuf;
|
|
@}
|
|
@end smallexample
|
|
|
|
@cindex stateful
|
|
This example shows the most important aspects of using the @code{iconv}
|
|
functions. It shows how successive calls to @code{iconv} can be used to
|
|
convert large amounts of text. The user does not have to care about
|
|
stateful encodings as the functions take care of everything.
|
|
|
|
An interesting point is the case where @code{iconv} return an error and
|
|
@code{errno} is set to @code{EINVAL}. This is not really an error in
|
|
the transformation. It can happen whenever the input character set
|
|
contains byte sequences of more than one byte for some character and
|
|
texts are not processed in one piece. In this case there is a chance
|
|
that a multibyte sequence is cut. The caller than can simply read the
|
|
remainder of the takes and feed the offending bytes together with new
|
|
character from the input to @code{iconv} and continue the work. The
|
|
internal state kept in the descriptor is @emph{not} unspecified after
|
|
such an event as it is the case with the conversion functions from the
|
|
@w{ISO C} standard.
|
|
|
|
The example also shows the problem of using wide character strings with
|
|
@code{iconv}. As explained in the description of the @code{iconv}
|
|
function above the function always takes a pointer to a @code{char}
|
|
array and the available space is measured in bytes. In the example the
|
|
output buffer is a wide character buffer. Therefore we use a local
|
|
variable @var{wrptr} of type @code{char *} which is used in the
|
|
@code{iconv} calls.
|
|
|
|
This looks rather innocent but can lead to problems on platforms which
|
|
have tight restriction on alignment. Therefore the caller of
|
|
@code{iconv} has to make sure that the pointers passed are suitable for
|
|
access of characters from the appropriate character set. Since in the
|
|
above case the input parameter to the function is a @code{wchar_t}
|
|
pointer this is the case (unless the user violates alignment when
|
|
computing the parameter). But in other situations, especially when
|
|
writing generic functions where one does not know what type of character
|
|
set one uses and therefore treats text as a sequence of bytes, it might
|
|
become tricky.
|
|
|
|
|
|
@node Other iconv Implementations
|
|
@subsection Some Details about other @code{iconv} Implementations
|
|
|
|
This is not really the place to discuss the @code{iconv} implementation
|
|
of other systems but it is necessary to know a bit about them to write
|
|
portable programs. The above mentioned problems with the specification
|
|
of the @code{iconv} functions can lead to portability issues.
|
|
|
|
The first thing to notice is that due to the large number of character
|
|
sets in use it is certainly not practical to encode the conversions
|
|
directly in the C library. Therefore the conversion information must
|
|
come from files outside the C library. This is usually done in one or
|
|
both of the following ways:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The C library contains a set of generic conversion functions which can
|
|
read the needed conversion tables and other information from data files.
|
|
These files get loaded when necessary.
|
|
|
|
This solution is problematic as it requires a great deal of effort to
|
|
apply to all character sets (potentially an infinite set). The
|
|
differences in the structure of the different character sets is so large
|
|
that many different variants of the table processing functions must be
|
|
developed. On top of this the generic nature of these functions make
|
|
them slower than specifically implemented functions.
|
|
|
|
@item
|
|
The C library only contains a framework which can dynamically load
|
|
object files and execute the therein contained conversion functions.
|
|
|
|
This solution provides much more flexibility. The C library itself
|
|
contains only very little code and therefore reduces the general memory
|
|
footprint. Also, with a documented interface between the C library and
|
|
the loadable modules it is possible for third parties to extend the set
|
|
of available conversion modules. A drawback of this solution is that
|
|
dynamic loading must be available.
|
|
@end itemize
|
|
|
|
Some implementations in commercial Unices implement a mixture of these
|
|
these possibilities, the majority only the second solution. Using
|
|
loadable modules moves the code out of the library itself and keeps the
|
|
door open for extensions and improvements. But this design is also
|
|
limiting on some platforms since not many platforms support dynamic
|
|
loading in statically linked programs. On platforms without his
|
|
capability it is therefore not possible to use this interface in
|
|
statically linked programs. The GNU C library has on ELF platforms no
|
|
problems with dynamic loading in in these situations and therefore this
|
|
point is moot. The danger is that one gets acquainted with this and
|
|
forgets about the restrictions on other systems.
|
|
|
|
A second thing to know about other @code{iconv} implementations is that
|
|
the number of available conversions is often very limited. Some
|
|
implementations provide in the standard release (not special
|
|
international or developer releases) at most 100 to 200 conversion
|
|
possibilities. This does not mean 200 different character sets are
|
|
supported. E.g., conversions from one character set to a set of, say,
|
|
10 others counts as 10 conversion. Together with the other direction
|
|
this makes already 20. One can imagine the thin coverage these platform
|
|
provide. Some Unix vendors even provide only a handful of conversions
|
|
which renders them useless for almost all uses.
|
|
|
|
This directly leads to a third and probably the most problematic point.
|
|
The way the @code{iconv} conversion functions are implemented on all
|
|
known Unix system and the availability of the conversion functions from
|
|
character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
|
|
@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
|
|
conversion from @math{@cal{A}} to @math{@cal{C}} is available.
|
|
|
|
This might not seem unreasonable and problematic at first but it is a
|
|
quite big problem as one will notice shortly after hitting it. To show
|
|
the problem we assume to write a program which has to convert from
|
|
@math{@cal{A}} to @math{@cal{C}}. A call like
|
|
|
|
@smallexample
|
|
cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
|
|
@end smallexample
|
|
|
|
@noindent
|
|
does fail according to the assumption above. But what does the program
|
|
do now? The conversion is really necessary and therefore simply giving
|
|
up is no possibility.
|
|
|
|
This is a nuisance. The @code{iconv} function should take care of this.
|
|
But how should the program proceed from here on? If it would try to
|
|
convert to character set @math{@cal{B}} first the two @code{iconv_open}
|
|
calls
|
|
|
|
@smallexample
|
|
cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
|
|
@end smallexample
|
|
|
|
@noindent
|
|
and
|
|
|
|
@smallexample
|
|
cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
|
|
@end smallexample
|
|
|
|
@noindent
|
|
will succeed but how to find @math{@cal{B}}?
|
|
|
|
Unfortunately, the answer is: there is no general solution. On some
|
|
systems guessing might help. On those systems most character sets can
|
|
convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text.
|
|
Beside this only some very system-specific methods can help. Since the
|
|
conversion functions come from loadable modules and these modules must
|
|
be stored somewhere in the filesystem, one @emph{could} try to find them
|
|
and determine from the available file which conversions are available
|
|
and whether there is an indirect route from @math{@cal{A}} to
|
|
@math{@cal{C}}.
|
|
|
|
This shows one of the design errors of @code{iconv} mentioned above. It
|
|
should at least be possible to determine the list of available
|
|
conversion programmatically so that if @code{iconv_open} says there is
|
|
no such conversion, one could make sure this also is true for indirect
|
|
routes.
|
|
|
|
|
|
@node glibc iconv Implementation
|
|
@subsection The @code{iconv} Implementation in the GNU C library
|
|
|
|
After reading about the problems of @code{iconv} implementations in the
|
|
last section it is certainly good to note that the implementation in
|
|
the GNU C library has none of the problems mentioned above. What
|
|
follows is a step-by-step analysis of the points raised above. The
|
|
evaluation is based on the current state of the development (as of
|
|
January 1999). The development of the @code{iconv} functions is not
|
|
complete, but basic functionality has solidified.
|
|
|
|
The GNU C library's @code{iconv} implementation uses shared loadable
|
|
modules to implement the conversions. A very small number of
|
|
conversions are built into the library itself but these are only rather
|
|
trivial conversions.
|
|
|
|
All the benefits of loadable modules are available in the GNU C library
|
|
implementation. This is especially appealing since the interface is
|
|
well documented (see below) and it therefore is easy to write new
|
|
conversion modules. The drawback of using loadable objects is not a
|
|
problem in the GNU C library, at least on ELF systems. Since the
|
|
library is able to load shared objects even in statically linked
|
|
binaries this means that static linking needs not to be forbidden in
|
|
case one wants to use @code{iconv}.
|
|
|
|
The second mentioned problem is the number of supported conversions.
|
|
Currently, the GNU C library supports more than 150 character sets. The
|
|
way the implementation is designed the number of supported conversions
|
|
is greater than 22350 (@math{150} times @math{149}). If any conversion
|
|
from or to a character set is missing it can easily be added.
|
|
|
|
Particularly impressive as it may be, this high number is due to the
|
|
fact that the GNU C library implementation of @code{iconv} does not have
|
|
the third problem mentioned above. I.e., whenever there is a conversion
|
|
from a character set @math{@cal{A}} to @math{@cal{B}} and from
|
|
@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
|
|
@math{@cal{A}} to @math{@cal{C}} directly. If the @code{iconv_open}
|
|
returns an error and sets @code{errno} to @code{EINVAL} this really
|
|
means there is no known way, directly or indirectly, to perform the
|
|
wanted conversion.
|
|
|
|
@cindex triangulation
|
|
This is achieved by providing for each character set a conversion from
|
|
and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
|
|
intermediate representation it is possible to @dfn{triangulate}, i.e.,
|
|
converting with an intermediate representation.
|
|
|
|
There is no inherent requirement to provide a conversion to @w{ISO
|
|
10646} for a new character set and it is also possible to provide other
|
|
conversions where neither source nor destination character set is @w{ISO
|
|
10646}. The currently existing set of conversions is simply meant to
|
|
cover all conversions which might be of interest.
|
|
|
|
@cindex ISO-2022-JP
|
|
@cindex EUC-JP
|
|
All currently available conversions use the triangulation method above,
|
|
making conversion run unnecessarily slow. If, e.g., somebody often
|
|
needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
|
|
would involve direct conversion between the two character sets, skipping
|
|
the input to @w{ISO 10646} first. The two character sets of interest
|
|
are much more similar to each other than to @w{ISO 10646}.
|
|
|
|
In such a situation one can easy write a new conversion and provide it
|
|
as a better alternative. The GNU C library @code{iconv} implementation
|
|
would automatically use the module implementing the conversion if it is
|
|
specified to be more efficient.
|
|
|
|
@subsubsection Format of @file{gconv-modules} files
|
|
|
|
All information about the available conversions comes from a file named
|
|
@file{gconv-modules} which can be found in any of the directories along
|
|
the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented
|
|
text files, where each of the lines has one of the following formats:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
If the first non-whitespace character is a @kbd{#} the line contains
|
|
only comments and is ignored.
|
|
|
|
@item
|
|
Lines starting with @code{alias} define an alias name for a character
|
|
set. There are two more words expected on the line. The first one
|
|
defines the alias name and the second defines the original name of the
|
|
character set. The effect is that it is possible to use the alias name
|
|
in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
|
|
achieve the same result as when using the real character set name.
|
|
|
|
This is quite important as a character set has often many different
|
|
names. There is normally always an official name but this need not
|
|
correspond to the most popular name. Beside this many character sets
|
|
have special names which are somehow constructed. E.g., all character
|
|
sets specified by the ISO have an alias of the form
|
|
@code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number.
|
|
This allows programs which know about the registration number to
|
|
construct character set names and use them in @code{iconv_open} calls.
|
|
More on the available names and aliases follows below.
|
|
|
|
@item
|
|
Lines starting with @code{module} introduce an available conversion
|
|
module. These lines must contain three or four more words.
|
|
|
|
The first word specifies the source character set, the second word the
|
|
destination character set of conversion implemented in this module. The
|
|
third word is the name of the loadable module. The filename is
|
|
constructed by appending the usual shared object suffix (normally
|
|
@file{.so}) and this file is then supposed to be found in the same
|
|
directory the @file{gconv-modules} file is in. The last word on the
|
|
line, which is optional, is a numeric value representing the cost of the
|
|
conversion. If this word is missing a cost of @math{1} is assumed. The
|
|
numeric value itself does not matter that much; what counts are the
|
|
relative values of the sums of costs for all possible conversion paths.
|
|
Below is a more precise description of the use of the cost value.
|
|
@end itemize
|
|
|
|
Returning to the example above where one has written a module to directly
|
|
convert from ISO-2022-JP to EUC-JP and back. All what has to be done is
|
|
to put the new module, be its name ISO2022JP-EUCJP.so, in a directory
|
|
and add a file @file{gconv-modules} with the following content in the
|
|
same directory:
|
|
|
|
@smallexample
|
|
module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
|
|
module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
|
|
@end smallexample
|
|
|
|
To see why this is sufficient, it is necessary to understand how the
|
|
conversion used by @code{iconv} (and described in the descriptor) is
|
|
selected. The approach to this problem is quite simple.
|
|
|
|
At the first call of the @code{iconv_open} function the program reads
|
|
all available @file{gconv-modules} files and builds up two tables: one
|
|
containing all the known aliases and another which contains the
|
|
information about the conversions and which shared object implements
|
|
them.
|
|
|
|
@subsubsection Finding the conversion path in @code{iconv}
|
|
|
|
The set of available conversions form a directed graph with weighted
|
|
edges. The weights on the edges are the costs specified in the
|
|
@file{gconv-modules} files. The @code{iconv_open} function uses an
|
|
algorithm suitable for search for the best path in such a graph and so
|
|
constructs a list of conversions which must be performed in succession
|
|
to get the transformation from the source to the destination character
|
|
set.
|
|
|
|
Explaining why the above @file{gconv-modules} files allows the
|
|
@code{iconv} implementation to resolve the specific ISO-2022-JP to
|
|
EUC-JP conversion module instead of the conversion coming with the
|
|
library itself is straightforward. Since the latter conversion takes two
|
|
steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
|
|
EUC-JP) the cost is @math{1+1 = 2}. But the above @file{gconv-modules}
|
|
file specifies that the new conversion modules can perform this
|
|
conversion with only the cost of @math{1}.
|
|
|
|
A mysterious piece about the @file{gconv-modules} file above (and also
|
|
the file coming with the GNU C library) are the names of the character
|
|
sets specified in the @code{module} lines. Why do almost all the names
|
|
end in @code{//}? And this is not all: the names can actually be
|
|
regular expressions. At this point of time this mystery should not be
|
|
revealed, unless you have the relevant spell-casting materials: ashes
|
|
from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
|
|
blessed by St.@: Emacs, assorted herbal roots from Central America, sand
|
|
from Cebu, etc. Sorry! @strong{The part of the implementation where
|
|
this is used is not yet finished. For now please simply follow the
|
|
existing examples. It'll become clearer once it is. --drepper}
|
|
|
|
A last remark about the @file{gconv-modules} is about the names not
|
|
ending with @code{//}. There often is a character set named
|
|
@code{INTERNAL} mentioned. From the discussion above and the chosen
|
|
name it should have become clear that this is the name for the
|
|
representation used in the intermediate step of the triangulation. We
|
|
have said that this is UCS-4 but actually it is not quite right. The
|
|
UCS-4 specification also includes the specification of the byte ordering
|
|
used. Since a UCS-4 value consists of four bytes a stored value is
|
|
effected by byte ordering. The internal representation is @emph{not}
|
|
the same as UCS-4 in case the byte ordering of the processor (or at least
|
|
the running process) is not the same as the one required for UCS-4. This
|
|
is done for performance reasons as one does not want to perform
|
|
unnecessary byte-swapping operations if one is not interested in actually
|
|
seeing the result in UCS-4. To avoid trouble with endianess the internal
|
|
representation consistently is named @code{INTERNAL} even on big-endian
|
|
systems where the representations are identical.
|
|
|
|
@subsubsection @code{iconv} module data structures
|
|
|
|
So far this section described how modules are located and considered to
|
|
be used. What remains to be described is the interface of the modules
|
|
so that one can write new ones. This section describes the interface as
|
|
it is in use in January 1999. The interface will change in future a bit
|
|
but hopefully only in an upward compatible way.
|
|
|
|
The definitions necessary to write new modules are publicly available
|
|
in the non-standard header @file{gconv.h}. The following text will
|
|
therefore describe the definitions from this header file. But first it
|
|
is necessary to get an overview.
|
|
|
|
From the perspective of the user of @code{iconv} the interface is quite
|
|
simple: the @code{iconv_open} function returns a handle which can be
|
|
used in calls to @code{iconv} and finally the handle is freed with a call
|
|
to @code{iconv_close}. The problem is: the handle has to be able to
|
|
represent the possibly long sequences of conversion steps and also the
|
|
state of each conversion since the handle is all which is passed to the
|
|
@code{iconv} function. Therefore the data structures are really the
|
|
elements to understanding the implementation.
|
|
|
|
We need two different kinds of data structures. The first describes the
|
|
conversion and the second describes the state etc. There are really two
|
|
type definitions like this in @file{gconv.h}.
|
|
@pindex gconv.h
|
|
|
|
@comment gconv.h
|
|
@comment GNU
|
|
@deftp {Data type} {struct __gconv_step}
|
|
This data structure describes one conversion a module can perform. For
|
|
each function in a loaded module with conversion functions there is
|
|
exactly one object of this type. This object is shared by all users of
|
|
the conversion. I.e., this object does not contain any information
|
|
corresponding to an actual conversion. It only describes the conversion
|
|
itself.
|
|
|
|
@table @code
|
|
@item struct __gconv_loaded_object *__shlib_handle
|
|
@itemx const char *__modname
|
|
@itemx int __counter
|
|
All these elements of the structure are used internally in the C library
|
|
to coordinate loading and unloading the shared. One must not expect any
|
|
of the other elements be available or initialized.
|
|
|
|
@item const char *__from_name
|
|
@itemx const char *__to_name
|
|
@code{__from_name} and @code{__to_name} contain the names of the source and
|
|
destination character sets. They can be used to identify the actual
|
|
conversion to be carried out since one module might implement
|
|
conversions for more than one character set and/or direction.
|
|
|
|
@item gconv_fct __fct
|
|
@itemx gconv_init_fct __init_fct
|
|
@itemx gconv_end_fct __end_fct
|
|
These elements contain pointers to the functions in the loadable module.
|
|
The interface will be explained below.
|
|
|
|
@item int __min_needed_from
|
|
@itemx int __max_needed_from
|
|
@itemx int __min_needed_to
|
|
@itemx int __max_needed_to;
|
|
These values have to be filled in the init function of the module. The
|
|
@code{__min_needed_from} value specifies how many bytes a character of
|
|
the source character set at least needs. The @code{__max_needed_from}
|
|
specifies the maximum value which also includes possible shift
|
|
sequences.
|
|
|
|
The @code{__min_needed_to} and @code{__max_needed_to} values serve the
|
|
same purpose but this time for the destination character set.
|
|
|
|
It is crucial that these values are accurate since otherwise the
|
|
conversion functions will have problems or not work at all.
|
|
|
|
@item int __stateful
|
|
This element must also be initialized by the init function. It is
|
|
nonzero if the source character set is stateful. Otherwise it is zero.
|
|
|
|
@item void *__data
|
|
This element can be used freely by the conversion functions in the
|
|
module. It can be used to communicate extra information from one call
|
|
to another. It need not be initialized if not needed at all. If this
|
|
element gets assigned a pointer to dynamically allocated memory
|
|
(presumably in the init function) it has to be made sure that the end
|
|
function deallocates the memory. Otherwise the application will leak
|
|
memory.
|
|
|
|
It is important to be aware that this data structure is shared by all
|
|
users of this specification conversion and therefore the @code{__data}
|
|
element must not contain data specific to one specific use of the
|
|
conversion function.
|
|
@end table
|
|
@end deftp
|
|
|
|
@comment gconv.h
|
|
@comment GNU
|
|
@deftp {Data type} {struct __gconv_step_data}
|
|
This is the data structure which contains the information specific to
|
|
each use of the conversion functions.
|
|
|
|
@table @code
|
|
@item char *__outbuf
|
|
@itemx char *__outbufend
|
|
These elements specify the output buffer for the conversion step. The
|
|
@code{__outbuf} element points to the beginning of the buffer and
|
|
@code{__outbufend} points to the byte following the last byte in the
|
|
buffer. The conversion function must not assume anything about the size
|
|
of the buffer but it can be safely assumed the there is room for at
|
|
least one complete character in the output buffer.
|
|
|
|
Once the conversion is finished and the conversion is the last step the
|
|
@code{__outbuf} element must be modified to point after last last byte
|
|
written into the buffer to signal how much output is available. If this
|
|
conversion step is not the last one the element must not be modified.
|
|
The @code{__outbufend} element must not be modified.
|
|
|
|
@item int __is_last
|
|
This element is nonzero if this conversion step is the last one. This
|
|
information is necessary for the recursion. See the description of the
|
|
conversion function internals below. This element must never be
|
|
modified.
|
|
|
|
@item int __invocation_counter
|
|
The conversion function can use this element to see how many calls of
|
|
the conversion function already happened. Some character sets require
|
|
when generating output a certain prolog and by comparing this value with
|
|
zero one can find out whether it is the first call and therefore the
|
|
prolog should be emitted or not. This element must never be modified.
|
|
|
|
@item int __internal_use
|
|
This element is another one rarely used but needed in certain
|
|
situations. It got assigned a nonzero value in case the conversion
|
|
functions are used to implement @code{mbsrtowcs} et.al. I.e., the
|
|
function is not used directly through the @code{iconv} interface.
|
|
|
|
This sometimes makes a difference as it is expected that the
|
|
@code{iconv} functions are used to translate entire texts while the
|
|
@code{mbsrtowcs} functions are normally only used to convert single
|
|
strings and might be used multiple times to convert entire texts.
|
|
|
|
But in this situation we would have problem complying with some rules of
|
|
the character set specification. Some character sets require a prolog
|
|
which must appear exactly once for an entire text. If a number of
|
|
@code{mbsrtowcs} calls are used to convert the text only the first call
|
|
must add the prolog. But since there is no communication between the
|
|
different calls of @code{mbsrtowcs} the conversion functions have no
|
|
possibility to find this out. The situation is different for sequences
|
|
of @code{iconv} calls since the handle allows access to the needed
|
|
information.
|
|
|
|
This element is mostly used together with @code{__invocation_counter} in
|
|
a way like this:
|
|
|
|
@smallexample
|
|
if (!data->__internal_use
|
|
&& data->__invocation_counter == 0)
|
|
/* @r{Emit prolog.} */
|
|
...
|
|
@end smallexample
|
|
|
|
This element must never be modified.
|
|
|
|
@item mbstate_t *__statep
|
|
The @code{__statep} element points to an object of type @code{mbstate_t}
|
|
(@pxref{Keeping the state}). The conversion of an stateful character
|
|
set must use the object pointed to by this element to store information
|
|
about the conversion state. The @code{__statep} element itself must
|
|
never be modified.
|
|
|
|
@item mbstate_t __state
|
|
This element @emph{never} must be used directly. It is only part of
|
|
this structure to have the needed space allocated.
|
|
@end table
|
|
@end deftp
|
|
|
|
@subsubsection @code{iconv} module interfaces
|
|
|
|
With the knowledge about the data structures we now can describe the
|
|
conversion functions itself. To understand the interface a bit of
|
|
knowledge about the functionality in the C library which loads the
|
|
objects with the conversions is necessary.
|
|
|
|
It is often the case that one conversion is used more than once. I.e.,
|
|
there are several @code{iconv_open} calls for the same set of character
|
|
sets during one program run. The @code{mbsrtowcs} et.al.@: functions in
|
|
the GNU C library also use the @code{iconv} functionality which
|
|
increases the number of uses of the same functions even more.
|
|
|
|
For this reason the modules do not get loaded exclusively for one
|
|
conversion. Instead a module once loaded can be used by arbitrarily many
|
|
@code{iconv} or @code{mbsrtowcs} calls at the same time. The splitting
|
|
of the information between conversion function specific information and
|
|
conversion data makes this possible. The last section showed the two
|
|
data structures used to do this.
|
|
|
|
This is of course also reflected in the interface and semantics of the
|
|
functions the modules must provide. There are three functions which
|
|
must have the following names:
|
|
|
|
@table @code
|
|
@item gconv_init
|
|
The @code{gconv_init} function initializes the conversion function
|
|
specific data structure. This very same object is shared by all
|
|
conversion which use this conversion and therefore no state information
|
|
about the conversion itself must be stored in here. If a module
|
|
implements more than one conversion the @code{gconv_init} function will be
|
|
called multiple times.
|
|
|
|
@item gconv_end
|
|
The @code{gconv_end} function is responsible to free all resources
|
|
allocated by the @code{gconv_init} function. If there is nothing to do
|
|
this function can be missing. Special care must be taken if the module
|
|
implements more than one conversion and the @code{gconv_init} function
|
|
does not allocate the same resources for all conversions.
|
|
|
|
@item gconv
|
|
This is the actual conversion function. It is called to convert one
|
|
block of text. It gets passed the conversion step information
|
|
initialized by @code{gconv_init} and the conversion data, specific to
|
|
this use of the conversion functions.
|
|
@end table
|
|
|
|
There are three data types defined for the three module interface
|
|
function and these define the interface.
|
|
|
|
@comment gconv.h
|
|
@comment GNU
|
|
@deftypevr {Data type} int (*__gconv_init_fct) (struct __gconv_step *)
|
|
This specifies the interface of the initialization function of the
|
|
module. It is called exactly once for each conversion the module
|
|
implements.
|
|
|
|
As explained int the description of the @code{struct __gconv_step} data
|
|
structure above the initialization function has to initialize parts of
|
|
it.
|
|
|
|
@table @code
|
|
@item __min_needed_from
|
|
@itemx __max_needed_from
|
|
@itemx __min_needed_to
|
|
@itemx __max_needed_to
|
|
These elements must be initialized to the exact numbers of the minimum
|
|
and maximum number of bytes used by one character in the source and
|
|
destination character set respectively. If the characters all have the
|
|
same size the minimum and maximum values are the same.
|
|
|
|
@item __stateful
|
|
This element must be initialized to an nonzero value if the source
|
|
character set is stateful. Otherwise it must be zero.
|
|
@end table
|
|
|
|
If the initialization function needs to communication some information
|
|
to the conversion function this can happen using the @code{__data}
|
|
element of the @code{__gconv_step} structure. But since this data is
|
|
shared by all the conversion is must not be modified by the conversion
|
|
function. How this can be used is shown in the example below.
|
|
|
|
@smallexample
|
|
#define MIN_NEEDED_FROM 1
|
|
#define MAX_NEEDED_FROM 4
|
|
#define MIN_NEEDED_TO 4
|
|
#define MAX_NEEDED_TO 4
|
|
|
|
int
|
|
gconv_init (struct __gconv_step *step)
|
|
@{
|
|
/* @r{Determine which direction.} */
|
|
struct iso2022jp_data *new_data;
|
|
enum direction dir = illegal_dir;
|
|
enum variant var = illegal_var;
|
|
int result;
|
|
|
|
if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
|
|
@{
|
|
dir = from_iso2022jp;
|
|
var = iso2022jp;
|
|
@}
|
|
else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
|
|
@{
|
|
dir = to_iso2022jp;
|
|
var = iso2022jp;
|
|
@}
|
|
else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
|
|
@{
|
|
dir = from_iso2022jp;
|
|
var = iso2022jp2;
|
|
@}
|
|
else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
|
|
@{
|
|
dir = to_iso2022jp;
|
|
var = iso2022jp2;
|
|
@}
|
|
|
|
result = __GCONV_NOCONV;
|
|
if (dir != illegal_dir)
|
|
@{
|
|
new_data = (struct iso2022jp_data *)
|
|
malloc (sizeof (struct iso2022jp_data));
|
|
|
|
result = __GCONV_NOMEM;
|
|
if (new_data != NULL)
|
|
@{
|
|
new_data->dir = dir;
|
|
new_data->var = var;
|
|
step->__data = new_data;
|
|
|
|
if (dir == from_iso2022jp)
|
|
@{
|
|
step->__min_needed_from = MIN_NEEDED_FROM;
|
|
step->__max_needed_from = MAX_NEEDED_FROM;
|
|
step->__min_needed_to = MIN_NEEDED_TO;
|
|
step->__max_needed_to = MAX_NEEDED_TO;
|
|
@}
|
|
else
|
|
@{
|
|
step->__min_needed_from = MIN_NEEDED_TO;
|
|
step->__max_needed_from = MAX_NEEDED_TO;
|
|
step->__min_needed_to = MIN_NEEDED_FROM;
|
|
step->__max_needed_to = MAX_NEEDED_FROM + 2;
|
|
@}
|
|
|
|
/* @r{Yes, this is a stateful encoding.} */
|
|
step->__stateful = 1;
|
|
|
|
result = __GCONV_OK;
|
|
@}
|
|
@}
|
|
|
|
return result;
|
|
@}
|
|
@end smallexample
|
|
|
|
The function first checks which conversion is wanted. The module from
|
|
which this function is taken implements four different conversion and
|
|
which one is selected can be determined by comparing the names. The
|
|
comparison should always be done without paying attention to the case.
|
|
|
|
Then a data structure is allocated which contains the necessary
|
|
information about which conversion is selected. The data structure
|
|
@code{struct iso2022jp_data} is locally defined since outside the module
|
|
this data is not used at all. Please note that if all four conversions
|
|
this modules supports are requested there are four data blocks.
|
|
|
|
One interesting thing is the initialization of the @code{__min_} and
|
|
@code{__max_} elements of the step data object. A single ISO-2022-JP
|
|
character can consist of one to four bytes. Therefore the
|
|
@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
|
|
this way. The output is always the @code{INTERNAL} character set (aka
|
|
UCS-4) and therefore each character consists of exactly four bytes. For
|
|
the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
|
|
account that escape sequences might be necessary to switch the character
|
|
sets. Therefore the @code{__max_needed_to} element for this direction
|
|
gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the
|
|
two bytes needed for the escape sequences to single the switching. The
|
|
asymmetry in the maximum values for the two directions can be explained
|
|
easily: when reading ISO-2022-JP text escape sequences can be handled
|
|
alone. I.e., it is not necessary to process a real character since the
|
|
effect of the escape sequence can be recorded in the state information.
|
|
The situation is different for the other direction. Since it is in
|
|
general not known which character comes next one cannot emit escape
|
|
sequences to change the state in advance. This means the escape
|
|
sequences which have to be emitted together with the next character.
|
|
Therefore one needs more room then only for the character itself.
|
|
|
|
The possible return values of the initialization function are:
|
|
|
|
@table @code
|
|
@item __GCONV_OK
|
|
The initialization succeeded
|
|
@item __GCONV_NOCONV
|
|
The requested conversion is not supported in the module. This can
|
|
happen if the @file{gconv-modules} file has errors.
|
|
@item __GCONV_NOMEM
|
|
Memory required to store additional information could not be allocated.
|
|
@end table
|
|
@end deftypevr
|
|
|
|
The functions called before the module is unloaded is significantly
|
|
easier. It often has nothing at all to do in which case it can be left
|
|
out completely.
|
|
|
|
@comment gconv.h
|
|
@comment GNU
|
|
@deftypevr {Data type} void (*__gconv_end_fct) (struct gconv_step *)
|
|
The task of this function is it to free all resources allocated in the
|
|
initialization function. Therefore only the @code{__data} element of
|
|
the object pointed to by the argument is of interest. Continuing the
|
|
example from the initialization function, the finalization function
|
|
looks like this:
|
|
|
|
@smallexample
|
|
void
|
|
gconv_end (struct __gconv_step *data)
|
|
@{
|
|
free (data->__data);
|
|
@}
|
|
@end smallexample
|
|
@end deftypevr
|
|
|
|
The most important function is the conversion function itself. It can
|
|
get quite complicated for complex character sets. But since this is not
|
|
of interest here we will only describe a possible skeleton for the
|
|
conversion function.
|
|
|
|
@comment gconv.h
|
|
@comment GNU
|
|
@deftypevr {Data type} int (*__gconv_fct) (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
|
|
The conversion function can be called for two basic reason: to convert
|
|
text or to reset the state. From the description of the @code{iconv}
|
|
function it can be seen why the flushing mode is necessary. What mode
|
|
is selected is determined by the sixth argument, an integer. If it is
|
|
nonzero it means that flushing is selected.
|
|
|
|
Common to both mode is where the output buffer can be found. The
|
|
information about this buffer is stored in the conversion step data. A
|
|
pointer to this is passed as the second argument to this function. The
|
|
description of the @code{struct __gconv_step_data} structure has more
|
|
information on this.
|
|
|
|
@cindex stateful
|
|
What has to be done for flushing depends on the source character set.
|
|
If it is not stateful nothing has to be done. Otherwise the function
|
|
has to emit a byte sequence to bring the state object in the initial
|
|
state. Once this all happened the other conversion modules in the chain
|
|
of conversions have to get the same chance. Whether another step
|
|
follows can be determined from the @code{__is_last} element of the step
|
|
data structure to which the first parameter points.
|
|
|
|
The more interesting mode is when actually text has to be converted.
|
|
The first step in this case is to convert as much text as possible from
|
|
the input buffer and store the result in the output buffer. The start
|
|
of the input buffer is determined by the third argument which is a
|
|
pointer to a pointer variable referencing the beginning of the buffer.
|
|
The fourth argument is a pointer to the byte right after the last byte
|
|
in the buffer.
|
|
|
|
The conversion has to be performed according to the current state if the
|
|
character set is stateful. The state is stored in an object pointed to
|
|
by the @code{__statep} element of the step data (second argument). Once
|
|
either the input buffer is empty or the output buffer is full the
|
|
conversion stops. At this point the pointer variable referenced by the
|
|
third parameter must point to the byte following the last processed
|
|
byte. I.e., if all of the input is consumed this pointer and the fourth
|
|
parameter have the same value.
|
|
|
|
What now happens depends on whether this step is the last one or not.
|
|
If it is the last step the only thing which has to be done is to update
|
|
the @code{__outbuf} element of the step data structure to point after the
|
|
last written byte. This gives the caller the information on how much
|
|
text is available in the output buffer. Beside this the variable
|
|
pointed to by the fifth parameter, which is of type @code{size_t}, must
|
|
be incremented by the number of characters (@emph{not bytes}) which were
|
|
converted in a non-reversible way. Then the function can return.
|
|
|
|
In case the step is not the last one the later conversion functions have
|
|
to get a chance to do their work. Therefore the appropriate conversion
|
|
function has to be called. The information about the functions is
|
|
stored in the conversion data structures, passed as the first parameter.
|
|
This information and the step data are stored in arrays so the next
|
|
element in both cases can be found by simple pointer arithmetic:
|
|
|
|
@smallexample
|
|
int
|
|
gconv (struct __gconv_step *step, struct __gconv_step_data *data,
|
|
const char **inbuf, const char *inbufend, size_t *written,
|
|
int do_flush)
|
|
@{
|
|
struct __gconv_step *next_step = step + 1;
|
|
struct __gconv_step_data *next_data = data + 1;
|
|
...
|
|
@end smallexample
|
|
|
|
The @code{next_step} pointer references the next step information and
|
|
@code{next_data} the next data record. The call of the next function
|
|
therefore will look similar to this:
|
|
|
|
@smallexample
|
|
next_step->__fct (next_step, next_data, &outerr, outbuf,
|
|
written, 0)
|
|
@end smallexample
|
|
|
|
But this is not yet all. Once the function call returns the conversion
|
|
function might have some more to do. If the return value of the
|
|
function is @code{__GCONV_EMPTY_INPUT} this means there is more room in
|
|
the output buffer. Unless the input buffer is empty the conversion
|
|
functions start all over again and processes the rest of the input
|
|
buffer. If the return value is not @code{__GCONV_EMPTY_INPUT} something
|
|
went wrong and we have to recover from this.
|
|
|
|
A requirement for the conversion function is that the input buffer
|
|
pointer (the third argument) always points to the last character which
|
|
was put in the converted form in the output buffer. This is trivially
|
|
true after the conversion performed in the current step. But if the
|
|
conversion functions deeper down the stream stop prematurely not all
|
|
characters from the output buffer are consumed and therefore the input
|
|
buffer pointers must be backed of to the right position.
|
|
|
|
This is easy to do if the input and output character sets have a fixed
|
|
width for all characters. In this situation we can compute how many
|
|
characters are left in the output buffer and therefore can correct the
|
|
input buffer pointer appropriate with a similar computation. Things are
|
|
getting tricky if either character set has character represented with
|
|
variable length byte sequences and it gets even more complicated if the
|
|
conversion has to take care of the state. In these cases the conversion
|
|
has to be performed once again, from the known state before the initial
|
|
conversion. I.e., if necessary the state of the conversion has to be
|
|
reset and the conversion loop has to be executed again. The difference
|
|
now is that it is known how much input must be created and the
|
|
conversion can stop before converting the first unused character. Once
|
|
this is done the input buffer pointers must be updated again and the
|
|
function can return.
|
|
|
|
One final thing should be mentioned. If it is necessary for the
|
|
conversion to know whether it is the first invocation (in case a prolog
|
|
has to be emitted) the conversion function should just before returning
|
|
to the caller increment the @code{__invocation_counter} element of the
|
|
step data structure. See the description of the @code{struct
|
|
__gconv_step_data} structure above for more information on how this can
|
|
be used.
|
|
|
|
The return value must be one of the following values:
|
|
|
|
@table @code
|
|
@item __GCONV_EMPTY_INPUT
|
|
All input was consumed and there is room left in the output buffer.
|
|
@item __GCONV_FULL_OUTPUT
|
|
No more room in the output buffer. In case this is not the last step
|
|
this value is propagated down from the call of the next conversion
|
|
function in the chain.
|
|
@item __GCONV_INCOMPLETE_INPUT
|
|
The input buffer is not entirely empty since it contains an incomplete
|
|
character sequence.
|
|
@end table
|
|
|
|
The following example provides a framework for a conversion function.
|
|
In case a new conversion has to be written the holes in this
|
|
implementation have to be filled and that is it.
|
|
|
|
@smallexample
|
|
int
|
|
gconv (struct __gconv_step *step, struct __gconv_step_data *data,
|
|
const char **inbuf, const char *inbufend, size_t *written,
|
|
int do_flush)
|
|
@{
|
|
struct __gconv_step *next_step = step + 1;
|
|
struct __gconv_step_data *next_data = data + 1;
|
|
gconv_fct fct = next_step->__fct;
|
|
int status;
|
|
|
|
/* @r{If the function is called with no input this means we have}
|
|
@r{to reset to the initial state. The possibly partly}
|
|
@r{converted input is dropped.} */
|
|
if (do_flush)
|
|
@{
|
|
status = __GCONV_OK;
|
|
|
|
/* @r{Possible emit a byte sequence which put the state object}
|
|
@r{into the initial state.} */
|
|
|
|
/* @r{Call the steps down the chain if there are any but only}
|
|
@r{if we successfully emitted the escape sequence.} */
|
|
if (status == __GCONV_OK && ! data->__is_last)
|
|
status = fct (next_step, next_data, NULL, NULL,
|
|
written, 1);
|
|
@}
|
|
else
|
|
@{
|
|
/* @r{We preserve the initial values of the pointer variables.} */
|
|
const char *inptr = *inbuf;
|
|
char *outbuf = data->__outbuf;
|
|
char *outend = data->__outbufend;
|
|
char *outptr;
|
|
|
|
do
|
|
@{
|
|
/* @r{Remember the start value for this round.} */
|
|
inptr = *inbuf;
|
|
/* @r{The outbuf buffer is empty.} */
|
|
outptr = outbuf;
|
|
|
|
/* @r{For stateful encodings the state must be safe here.} */
|
|
|
|
/* @r{Run the conversion loop. @code{status} is set}
|
|
@r{appropriately afterwards.} */
|
|
|
|
/* @r{If this is the last step leave the loop, there is}
|
|
@r{nothing we can do.} */
|
|
if (data->__is_last)
|
|
@{
|
|
/* @r{Store information about how many bytes are}
|
|
@r{available.} */
|
|
data->__outbuf = outbuf;
|
|
|
|
/* @r{If any non-reversible conversions were performed,}
|
|
@r{add the number to @code{*written}.} */
|
|
|
|
break;
|
|
@}
|
|
|
|
/* @r{Write out all output which was produced.} */
|
|
if (outbuf > outptr)
|
|
@{
|
|
const char *outerr = data->__outbuf;
|
|
int result;
|
|
|
|
result = fct (next_step, next_data, &outerr,
|
|
outbuf, written, 0);
|
|
|
|
if (result != __GCONV_EMPTY_INPUT)
|
|
@{
|
|
if (outerr != outbuf)
|
|
@{
|
|
/* @r{Reset the input buffer pointer. We}
|
|
@r{document here the complex case.} */
|
|
size_t nstatus;
|
|
|
|
/* @r{Reload the pointers.} */
|
|
*inbuf = inptr;
|
|
outbuf = outptr;
|
|
|
|
/* @r{Possibly reset the state.} */
|
|
|
|
/* @r{Redo the conversion, but this time}
|
|
@r{the end of the output buffer is at}
|
|
@r{@code{outerr}.} */
|
|
@}
|
|
|
|
/* @r{Change the status.} */
|
|
status = result;
|
|
@}
|
|
else
|
|
/* @r{All the output is consumed, we can make}
|
|
@r{ another run if everything was ok.} */
|
|
if (status == __GCONV_FULL_OUTPUT)
|
|
status = __GCONV_OK;
|
|
@}
|
|
@}
|
|
while (status == __GCONV_OK);
|
|
|
|
/* @r{We finished one use of this step.} */
|
|
++data->__invocation_counter;
|
|
@}
|
|
|
|
return status;
|
|
@}
|
|
@end smallexample
|
|
@end deftypevr
|
|
|
|
This information should be sufficient to write new modules. Anybody
|
|
doing so should also take a look at the available source code in the GNU
|
|
C library sources. It contains many examples of working and optimized
|
|
modules.
|