Index: gcc/ChangeLog
2005-03-14 Geoffrey Keating <geoffk@apple.com> * doc/cppopts.texi (-fexec-charset): Add concept index entry. (-fwide-exec-charset): Likewise. (-finput-charset): Likewise. * doc/invoke.texi (Warning Options): Document -Wnormalized=. * c-opts.c (c_common_handle_option): Handle -Wnormalized=. * c.opt (Wnormalized): New. Index: libcpp/ChangeLog 2005-03-14 Geoffrey Keating <geoffk@apple.com> * init.c (cpp_create_reader): Default warn_normalize to normalized_C. * charset.c: Update for new format of ucnid.h. (ucn_valid_in_identifier): Update for new format of ucnid.h. Add NST parameter, and update it; update callers. (cpp_valid_ucn): Add NST parameter, update callers. Replace abort with cpp_error. (convert_ucn): Pass normalize_state to cpp_valid_ucn. * internal.h (struct normalize_state): New. (INITIAL_NORMALIZE_STATE): New. (NORMALIZE_STATE_RESULT): New. (NORMALIZE_STATE_UPDATE_IDNUM): New. (_cpp_valid_ucn): New. * lex.c (warn_about_normalization): New. (forms_identifier_p): Add normalize_state parameter, update callers. (lex_identifier): Add normalize_state parameter, update callers. Keep the state current. (lex_number): Likewise. (_cpp_lex_direct): Pass normalize_state to subroutines. Check it with warn_about_normalization. * makeucnid.c: New. * ucnid.h: Replace. * ucnid.pl: Remove. * ucnid.tab: Make appropriate for input to makeucnid.c. Remove comments about obsolete version of C++. * include/cpplib.h (enum cpp_normalize_level): New. (struct cpp_options): Add warn_normalize field. Index: gcc/testsuite/ChangeLog 2005-03-14 Geoffrey Keating <geoffk@apple.com> * gcc.dg/cpp/normalize-1.c: New. * gcc.dg/cpp/normalize-2.c: New. * gcc.dg/cpp/normalize-3.c: New. * gcc.dg/cpp/normalize-4.c: New. * gcc.dg/cpp/ucnid-4.c: New. * gcc.dg/cpp/ucnid-5.c: New. * g++.dg/cpp/normalize-1.C: New. * g++.dg/cpp/ucnid-1.C: New. From-SVN: r96459
This commit is contained in:
parent
cd8b38b9eb
commit
50668cf626
@ -1,3 +1,12 @@
|
||||
2005-03-14 Geoffrey Keating <geoffk@apple.com>
|
||||
|
||||
* doc/cppopts.texi (-fexec-charset): Add concept index entry.
|
||||
(-fwide-exec-charset): Likewise.
|
||||
(-finput-charset): Likewise.
|
||||
* doc/invoke.texi (Warning Options): Document -Wnormalized=.
|
||||
* c-opts.c (c_common_handle_option): Handle -Wnormalized=.
|
||||
* c.opt (Wnormalized): New.
|
||||
|
||||
2005-03-14 Devang Patel <dpatel@apple.com>
|
||||
|
||||
* doc/invoke.texi: Add reference to Visibility document.
|
||||
|
13
gcc/c-opts.c
13
gcc/c-opts.c
@ -460,6 +460,19 @@ c_common_handle_option (size_t scode, const char *arg, int value)
|
||||
cpp_opts->warn_multichar = value;
|
||||
break;
|
||||
|
||||
case OPT_Wnormalized_:
|
||||
if (!value || (arg && strcasecmp (arg, "none") == 0))
|
||||
cpp_opts->warn_normalize = normalized_none;
|
||||
else if (!arg || strcasecmp (arg, "nfkc") == 0)
|
||||
cpp_opts->warn_normalize = normalized_KC;
|
||||
else if (strcasecmp (arg, "id") == 0)
|
||||
cpp_opts->warn_normalize = normalized_identifier_C;
|
||||
else if (strcasecmp (arg, "nfc") == 0)
|
||||
cpp_opts->warn_normalize = normalized_C;
|
||||
else
|
||||
error ("argument %qs to %<-Wnormalized%> not recognized", arg);
|
||||
break;
|
||||
|
||||
case OPT_Wreturn_type:
|
||||
warn_return_type = value;
|
||||
break;
|
||||
|
@ -285,6 +285,10 @@ Wnonnull
|
||||
C ObjC Var(warn_nonnull)
|
||||
Warn about NULL being passed to argument slots marked as requiring non-NULL
|
||||
|
||||
Wnormalized=
|
||||
C ObjC C++ ObjC++ Joined
|
||||
-Wnormalized=<id|nfc|nfkc> Warn about non-normalised Unicode strings
|
||||
|
||||
Wold-style-cast
|
||||
C++ ObjC++ Var(warn_old_style_cast)
|
||||
Warn if a C-style cast is used in a program
|
||||
|
@ -530,12 +530,14 @@ ignored. The default is 8.
|
||||
|
||||
@item -fexec-charset=@var{charset}
|
||||
@opindex fexec-charset
|
||||
@cindex character set, execution
|
||||
Set the execution character set, used for string and character
|
||||
constants. The default is UTF-8. @var{charset} can be any encoding
|
||||
supported by the system's @code{iconv} library routine.
|
||||
|
||||
@item -fwide-exec-charset=@var{charset}
|
||||
@opindex fwide-exec-charset
|
||||
@cindex character set, wide execution
|
||||
Set the wide execution character set, used for wide string and
|
||||
character constants. The default is UTF-32 or UTF-16, whichever
|
||||
corresponds to the width of @code{wchar_t}. As with
|
||||
@ -545,6 +547,7 @@ problems with encodings that do not fit exactly in @code{wchar_t}.
|
||||
|
||||
@item -finput-charset=@var{charset}
|
||||
@opindex finput-charset
|
||||
@cindex character set, input
|
||||
Set the input character set, used for translation from the character
|
||||
set of the input file to the source character set used by GCC@. If the
|
||||
locale does not specify, or GCC cannot get this information from the
|
||||
|
@ -3039,6 +3039,51 @@ Do not warn if a multicharacter constant (@samp{'FOOF'}) is used.
|
||||
Usually they indicate a typo in the user's code, as they have
|
||||
implementation-defined values, and should not be used in portable code.
|
||||
|
||||
@item -Wnormalized=<none|id|nfc|nfkc>
|
||||
@opindex Wnormalized
|
||||
@cindex NFC
|
||||
@cindex NFKC
|
||||
@cindex character set, input normalization
|
||||
In ISO C and ISO C++, two identifiers are different if they are
|
||||
different sequences of characters. However, sometimes when characters
|
||||
outside the basic ASCII character set are used, you can have two
|
||||
different character sequences that look the same. To avoid confusion,
|
||||
the ISO 10646 standard sets out some @dfn{normalization rules} which
|
||||
when applied ensure that two sequences that look the same are turned into
|
||||
the same sequence. GCC can warn you if you are using identifiers which
|
||||
have not been normalized; this option controls that warning.
|
||||
|
||||
There are four levels of warning that GCC supports. The default is
|
||||
@option{-Wnormalized=nfc}, which warns about any identifier which is
|
||||
not in the ISO 10646 ``C'' normalized form, @dfn{NFC}. NFC is the
|
||||
recommended form for most uses.
|
||||
|
||||
Unfortunately, there are some characters which ISO C and ISO C++ allow
|
||||
in identifiers that when turned into NFC aren't allowable as
|
||||
identifiers. That is, there's no way to use these symbols in portable
|
||||
ISO C or C++ and have all your identifiers in NFC.
|
||||
@option{-Wnormalized=id} suppresses the warning for these characters.
|
||||
It is hoped that future versions of the standards involved will correct
|
||||
this, which is why this option is not the default.
|
||||
|
||||
You can switch the warning off for all characters by writing
|
||||
@option{-Wnormalized=none}. You would only want to do this if you
|
||||
were using some other normalization scheme (like ``D''), because
|
||||
otherwise you can easily create bugs that are literally impossible to see.
|
||||
|
||||
Some characters in ISO 10646 have distinct meanings but look identical
|
||||
in some fonts or display methodologies, especially once formatting has
|
||||
been applied. For instance @code{\u207F}, ``SUPERSCRIPT LATIN SMALL
|
||||
LETTER N'', will display just like a regular @code{n} which has been
|
||||
placed in a superscript. ISO 10646 defines the @dfn{NFKC}
|
||||
normalisation scheme to convert all these into a standard form as
|
||||
well, and GCC will warn if your code is not in NFKC if you use
|
||||
@option{-Wnormalized=nfkc}. This warning is comparable to warning
|
||||
about every identifier that contains the letter O because it might be
|
||||
confused with the digit 0, and so is not the default, but may be
|
||||
useful as a local coding convention if the programming environment is
|
||||
unable to be fixed to display these characters distinctly.
|
||||
|
||||
@item -Wno-deprecated-declarations
|
||||
@opindex Wno-deprecated-declarations
|
||||
Do not warn about uses of functions, variables, and types marked as
|
||||
|
@ -1,3 +1,14 @@
|
||||
2005-03-14 Geoffrey Keating <geoffk@apple.com>
|
||||
|
||||
* gcc.dg/cpp/normalize-1.c: New.
|
||||
* gcc.dg/cpp/normalize-2.c: New.
|
||||
* gcc.dg/cpp/normalize-3.c: New.
|
||||
* gcc.dg/cpp/normalize-4.c: New.
|
||||
* gcc.dg/cpp/ucnid-4.c: New.
|
||||
* gcc.dg/cpp/ucnid-5.c: New.
|
||||
* g++.dg/cpp/normalize-1.C: New.
|
||||
* g++.dg/cpp/ucnid-1.C: New.
|
||||
|
||||
2005-03-14 Alexandre Oliva <aoliva@redhat.com>
|
||||
|
||||
* gcc.dg/pr18628.c: New.
|
||||
|
34
gcc/testsuite/g++.dg/cpp/normalize-1.C
Normal file
34
gcc/testsuite/g++.dg/cpp/normalize-1.C
Normal file
@ -0,0 +1,34 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-Wnormalized=id" } */
|
||||
|
||||
\u00AA
|
||||
\u00B7
|
||||
\u0F43 /* { dg-warning "not in NFC" } */
|
||||
a\u05B8\u05B9\u05B9\u05BBb
|
||||
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
|
||||
\u09CB
|
||||
\u09C7\u09BE /* { dg-warning "not in NFC" } */
|
||||
\u0B4B
|
||||
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
|
||||
\u0BCA
|
||||
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0BCB
|
||||
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0CCA
|
||||
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
|
||||
\u0D4A
|
||||
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
|
||||
\u0D4B
|
||||
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
|
||||
|
||||
K
|
||||
\u212A
|
||||
|
||||
\u03AC
|
||||
\u1F71 /* { dg-warning "not in NFC" } */
|
||||
|
||||
\uAC00
|
||||
\u1100\u1161
|
||||
\uAC01
|
||||
\u1100\u1161\u11A8
|
||||
\uAC00\u11A8
|
17
gcc/testsuite/g++.dg/cpp/ucnid-1.C
Normal file
17
gcc/testsuite/g++.dg/cpp/ucnid-1.C
Normal file
@ -0,0 +1,17 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-pedantic" } */
|
||||
|
||||
\u00AA /* { dg-error "not valid in an identifier" } */
|
||||
\u00AB /* { dg-error "not valid in an identifier" } */
|
||||
\u00B6 /* { dg-error "not valid in an identifier" } */
|
||||
\u00BA /* { dg-error "not valid in an identifier" } */
|
||||
\u00C0
|
||||
\u00D6
|
||||
\u0384
|
||||
|
||||
\u0669 /* { dg-error "not valid in an identifier" } */
|
||||
A\u0669 /* { dg-error "not valid in an identifier" } */
|
||||
0\u00BA /* { dg-error "not valid in an identifier" } */
|
||||
0\u0669 /* { dg-error "not valid in an identifier" } */
|
||||
\u0E59
|
||||
A\u0E59
|
34
gcc/testsuite/gcc.dg/cpp/normalize-1.c
Normal file
34
gcc/testsuite/gcc.dg/cpp/normalize-1.c
Normal file
@ -0,0 +1,34 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99" } */
|
||||
|
||||
\u00AA
|
||||
\u00B7
|
||||
\u0F43 /* { dg-warning "not in NFC" } */
|
||||
a\u05B8\u05B9\u05B9\u05BBb
|
||||
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
|
||||
\u09CB
|
||||
\u09C7\u09BE /* { dg-warning "not in NFC" } */
|
||||
\u0B4B
|
||||
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
|
||||
\u0BCA
|
||||
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0BCB
|
||||
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0CCA
|
||||
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
|
||||
\u0D4A
|
||||
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
|
||||
\u0D4B
|
||||
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
|
||||
|
||||
K
|
||||
\u212A /* { dg-warning "not in NFC" } */
|
||||
|
||||
\u03AC
|
||||
\u1F71 /* { dg-warning "not in NFC" } */
|
||||
|
||||
\uAC00
|
||||
\u1100\u1161 /* { dg-warning "not in NFC" } */
|
||||
\uAC01
|
||||
\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
|
||||
\uAC00\u11A8 /* { dg-warning "not in NFC" } */
|
34
gcc/testsuite/gcc.dg/cpp/normalize-2.c
Normal file
34
gcc/testsuite/gcc.dg/cpp/normalize-2.c
Normal file
@ -0,0 +1,34 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99 -Wnormalized=nfkc" } */
|
||||
|
||||
\u00AA /* { dg-warning "not in NFKC" } */
|
||||
\u00B7
|
||||
\u0F43 /* { dg-warning "not in NFC" } */
|
||||
a\u05B8\u05B9\u05B9\u05BBb
|
||||
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
|
||||
\u09CB
|
||||
\u09C7\u09BE /* { dg-warning "not in NFC" } */
|
||||
\u0B4B
|
||||
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
|
||||
\u0BCA
|
||||
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0BCB
|
||||
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0CCA
|
||||
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
|
||||
\u0D4A
|
||||
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
|
||||
\u0D4B
|
||||
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
|
||||
|
||||
K
|
||||
\u212A /* { dg-warning "not in NFC" } */
|
||||
|
||||
\u03AC
|
||||
\u1F71 /* { dg-warning "not in NFC" } */
|
||||
|
||||
\uAC00
|
||||
\u1100\u1161 /* { dg-warning "not in NFC" } */
|
||||
\uAC01
|
||||
\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
|
||||
\uAC00\u11A8 /* { dg-warning "not in NFC" } */
|
34
gcc/testsuite/gcc.dg/cpp/normalize-3.c
Normal file
34
gcc/testsuite/gcc.dg/cpp/normalize-3.c
Normal file
@ -0,0 +1,34 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99 -Wnormalized=id" } */
|
||||
|
||||
\u00AA
|
||||
\u00B7
|
||||
\u0F43 /* { dg-warning "not in NFC" } */
|
||||
a\u05B8\u05B9\u05B9\u05BBb
|
||||
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
|
||||
\u09CB
|
||||
\u09C7\u09BE /* { dg-warning "not in NFC" } */
|
||||
\u0B4B
|
||||
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
|
||||
\u0BCA
|
||||
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0BCB
|
||||
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
|
||||
\u0CCA
|
||||
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
|
||||
\u0D4A
|
||||
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
|
||||
\u0D4B
|
||||
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
|
||||
|
||||
K
|
||||
\u212A
|
||||
|
||||
\u03AC
|
||||
\u1F71 /* { dg-warning "not in NFC" } */
|
||||
|
||||
\uAC00
|
||||
\u1100\u1161
|
||||
\uAC01
|
||||
\u1100\u1161\u11A8
|
||||
\uAC00\u11A8
|
34
gcc/testsuite/gcc.dg/cpp/normalize-4.c
Normal file
34
gcc/testsuite/gcc.dg/cpp/normalize-4.c
Normal file
@ -0,0 +1,34 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99 -Wnormalized=none" } */
|
||||
|
||||
\u00AA
|
||||
\u00B7
|
||||
\u0F43
|
||||
a\u05B8\u05B9\u05B9\u05BBb
|
||||
a\u05BB\u05B9\u05B8\u05B9b
|
||||
\u09CB
|
||||
\u09C7\u09BE
|
||||
\u0B4B
|
||||
\u0B47\u0B3E
|
||||
\u0BCA
|
||||
\u0BC6\u0BBE
|
||||
\u0BCB
|
||||
\u0BC7\u0BBE
|
||||
\u0CCA
|
||||
\u0CC6\u0CC2
|
||||
\u0D4A
|
||||
\u0D46\u0D3E
|
||||
\u0D4B
|
||||
\u0D47\u0D3E
|
||||
|
||||
K
|
||||
\u212A
|
||||
|
||||
\u03AC
|
||||
\u1F71
|
||||
|
||||
\uAC00
|
||||
\u1100\u1161
|
||||
\uAC01
|
||||
\u1100\u1161\u11A8
|
||||
\uAC00\u11A8
|
17
gcc/testsuite/gcc.dg/cpp/ucnid-4.c
Normal file
17
gcc/testsuite/gcc.dg/cpp/ucnid-4.c
Normal file
@ -0,0 +1,17 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99" } */
|
||||
|
||||
\u00AA
|
||||
\u00AB /* { dg-error "not valid in an identifier" } */
|
||||
\u00B6 /* { dg-error "not valid in an identifier" } */
|
||||
\u00BA
|
||||
\u00C0
|
||||
\u00D6
|
||||
\u0384
|
||||
|
||||
\u0669 /* { dg-error "not valid at the start of an identifier" } */
|
||||
A\u0669
|
||||
0\u00BA
|
||||
0\u0669
|
||||
\u0E59 /* { dg-error "not valid at the start of an identifier" } */
|
||||
A\u0E59
|
17
gcc/testsuite/gcc.dg/cpp/ucnid-5.c
Normal file
17
gcc/testsuite/gcc.dg/cpp/ucnid-5.c
Normal file
@ -0,0 +1,17 @@
|
||||
/* { dg-do preprocess } */
|
||||
/* { dg-options "-std=c99 -pedantic" } */
|
||||
|
||||
\u00AA
|
||||
\u00AB /* { dg-error "not valid in an identifier" } */
|
||||
\u00B6 /* { dg-error "not valid in an identifier" } */
|
||||
\u00BA
|
||||
\u00C0
|
||||
\u00D6
|
||||
\u0384 /* { dg-error "not valid in an identifier" } */
|
||||
|
||||
\u0669 /* { dg-error "not valid at the start of an identifier" } */
|
||||
A\u0669
|
||||
0\u00BA
|
||||
0\u0669
|
||||
\u0E59 /* { dg-error "not valid at the start of an identifier" } */
|
||||
A\u0E59
|
@ -1,3 +1,32 @@
|
||||
2005-03-14 Geoffrey Keating <geoffk@apple.com>
|
||||
|
||||
* init.c (cpp_create_reader): Default warn_normalize to normalized_C.
|
||||
* charset.c: Update for new format of ucnid.h.
|
||||
(ucn_valid_in_identifier): Update for new format of ucnid.h.
|
||||
Add NST parameter, and update it; update callers.
|
||||
(cpp_valid_ucn): Add NST parameter, update callers. Replace abort
|
||||
with cpp_error.
|
||||
(convert_ucn): Pass normalize_state to cpp_valid_ucn.
|
||||
* internal.h (struct normalize_state): New.
|
||||
(INITIAL_NORMALIZE_STATE): New.
|
||||
(NORMALIZE_STATE_RESULT): New.
|
||||
(NORMALIZE_STATE_UPDATE_IDNUM): New.
|
||||
(_cpp_valid_ucn): New.
|
||||
* lex.c (warn_about_normalization): New.
|
||||
(forms_identifier_p): Add normalize_state parameter, update callers.
|
||||
(lex_identifier): Add normalize_state parameter, update callers. Keep
|
||||
the state current.
|
||||
(lex_number): Likewise.
|
||||
(_cpp_lex_direct): Pass normalize_state to subroutines. Check
|
||||
it with warn_about_normalization.
|
||||
* makeucnid.c: New.
|
||||
* ucnid.h: Replace.
|
||||
* ucnid.pl: Remove.
|
||||
* ucnid.tab: Make appropriate for input to makeucnid.c. Remove
|
||||
comments about obsolete version of C++.
|
||||
* include/cpplib.h (enum cpp_normalize_level): New.
|
||||
(struct cpp_options): Add warn_normalize field.
|
||||
|
||||
2005-03-11 Geoffrey Keating <geoffk@apple.com>
|
||||
|
||||
* directives.c (glue_header_name): Update call to cpp_spell_token.
|
||||
|
142
libcpp/charset.c
142
libcpp/charset.c
@ -22,7 +22,6 @@ Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */
|
||||
#include "system.h"
|
||||
#include "cpplib.h"
|
||||
#include "internal.h"
|
||||
#include "ucnid.h"
|
||||
|
||||
/* Character set handling for C-family languages.
|
||||
|
||||
@ -786,43 +785,128 @@ width_to_mask (size_t width)
|
||||
return ((size_t) 1 << width) - 1;
|
||||
}
|
||||
|
||||
/* A large table of unicode character information. */
|
||||
enum {
|
||||
/* Valid in a C99 identifier? */
|
||||
C99 = 1,
|
||||
/* Valid in a C99 identifier, but not as the first character? */
|
||||
DIG = 2,
|
||||
/* Valid in a C++ identifier? */
|
||||
CXX = 4,
|
||||
/* NFC representation is not valid in an identifier? */
|
||||
CID = 8,
|
||||
/* Might be valid NFC form? */
|
||||
NFC = 16,
|
||||
/* Might be valid NFKC form? */
|
||||
NKC = 32,
|
||||
/* Certain preceding characters might make it not valid NFC/NKFC form? */
|
||||
CTX = 64
|
||||
};
|
||||
|
||||
static const struct {
|
||||
/* Bitmap of flags above. */
|
||||
unsigned char flags;
|
||||
/* Combining class of the character. */
|
||||
unsigned char combine;
|
||||
/* Last character in the range described by this entry. */
|
||||
unsigned short end;
|
||||
} ucnranges[] = {
|
||||
#include "ucnid.h"
|
||||
};
|
||||
|
||||
/* Returns 1 if C is valid in an identifier, 2 if C is valid except at
|
||||
the start of an identifier, and 0 if C is not valid in an
|
||||
identifier. We assume C has already gone through the checks of
|
||||
_cpp_valid_ucn. The algorithm is a simple binary search on the
|
||||
table defined in cppucnid.h. */
|
||||
_cpp_valid_ucn. Also update NST for C if returning nonzero. The
|
||||
algorithm is a simple binary search on the table defined in
|
||||
ucnid.h. */
|
||||
|
||||
static int
|
||||
ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)
|
||||
ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
|
||||
struct normalize_state *nst)
|
||||
{
|
||||
int mn, mx, md;
|
||||
|
||||
mn = -1;
|
||||
mx = ARRAY_SIZE (ucnranges);
|
||||
while (mx - mn > 1)
|
||||
{
|
||||
md = (mn + mx) / 2;
|
||||
if (c < ucnranges[md].lo)
|
||||
mx = md;
|
||||
else if (c > ucnranges[md].hi)
|
||||
mn = md;
|
||||
else
|
||||
goto found;
|
||||
}
|
||||
if (c > 0xFFFF)
|
||||
return 0;
|
||||
|
||||
found:
|
||||
mn = 0;
|
||||
mx = ARRAY_SIZE (ucnranges) - 1;
|
||||
while (mx != mn)
|
||||
{
|
||||
md = (mn + mx) / 2;
|
||||
if (c <= ucnranges[md].end)
|
||||
mx = md;
|
||||
else
|
||||
mn = md + 1;
|
||||
}
|
||||
|
||||
/* When -pedantic, we require the character to have been listed by
|
||||
the standard for the current language. Otherwise, we accept the
|
||||
union of the acceptable sets for C++98 and C99. */
|
||||
if (CPP_PEDANTIC (pfile)
|
||||
&& ((CPP_OPTION (pfile, c99) && !(ucnranges[md].flags & C99))
|
||||
|| (CPP_OPTION (pfile, cplusplus)
|
||||
&& !(ucnranges[md].flags & CXX))))
|
||||
if (! (ucnranges[mn].flags & (C99 | CXX)))
|
||||
return 0;
|
||||
|
||||
if (CPP_PEDANTIC (pfile)
|
||||
&& ((CPP_OPTION (pfile, c99) && !(ucnranges[mn].flags & C99))
|
||||
|| (CPP_OPTION (pfile, cplusplus)
|
||||
&& !(ucnranges[mn].flags & CXX))))
|
||||
return 0;
|
||||
|
||||
/* Update NST. */
|
||||
if (ucnranges[mn].combine != 0 && ucnranges[mn].combine < nst->prev_class)
|
||||
nst->level = normalized_none;
|
||||
else if (ucnranges[mn].flags & CTX)
|
||||
{
|
||||
bool safe;
|
||||
cppchar_t p = nst->previous;
|
||||
|
||||
/* Easy cases from Bengali, Oriya, Tamil, Jannada, and Malayalam. */
|
||||
if (c == 0x09BE)
|
||||
safe = p != 0x09C7; /* Use 09CB instead of 09C7 09BE. */
|
||||
else if (c == 0x0B3E)
|
||||
safe = p != 0x0B47; /* Use 0B4B instead of 0B47 0B3E. */
|
||||
else if (c == 0x0BBE)
|
||||
safe = p != 0x0BC6 && p != 0x0BC7; /* Use 0BCA/0BCB instead. */
|
||||
else if (c == 0x0CC2)
|
||||
safe = p != 0x0CC6; /* Use 0CCA instead of 0CC6 0CC2. */
|
||||
else if (c == 0x0D3E)
|
||||
safe = p != 0x0D46 && p != 0x0D47; /* Use 0D4A/0D4B instead. */
|
||||
/* For Hangul, characters in the range AC00-D7A3 are NFC/NFKC,
|
||||
and are combined algorithmically from a sequence of the form
|
||||
1100-1112 1161-1175 11A8-11C2
|
||||
(if the third is not present, it is treated as 11A7, which is not
|
||||
really a valid character).
|
||||
Unfortunately, C99 allows (only) the NFC form, but C++ allows
|
||||
only the combining characters. */
|
||||
else if (c >= 0x1161 && c <= 0x1175)
|
||||
safe = p < 0x1100 || p > 0x1112;
|
||||
else if (c >= 0x11A8 && c <= 0x11C2)
|
||||
safe = (p < 0xAC00 || p > 0xD7A3 || (p - 0xAC00) % 28 != 0);
|
||||
else
|
||||
{
|
||||
/* Uh-oh, someone updated ucnid.h without updating this code. */
|
||||
cpp_error (pfile, CPP_DL_ICE, "Character %x might not be NFKC", c);
|
||||
safe = true;
|
||||
}
|
||||
if (!safe && c < 0x1161)
|
||||
nst->level = normalized_none;
|
||||
else if (!safe)
|
||||
nst->level = MAX (nst->level, normalized_identifier_C);
|
||||
}
|
||||
else if (ucnranges[mn].flags & NKC)
|
||||
;
|
||||
else if (ucnranges[mn].flags & NFC)
|
||||
nst->level = MAX (nst->level, normalized_C);
|
||||
else if (ucnranges[mn].flags & CID)
|
||||
nst->level = MAX (nst->level, normalized_identifier_C);
|
||||
else
|
||||
nst->level = normalized_none;
|
||||
nst->previous = c;
|
||||
nst->prev_class = ucnranges[mn].combine;
|
||||
|
||||
/* In C99, UCN digits may not begin identifiers. */
|
||||
if (CPP_OPTION (pfile, c99) && (ucnranges[md].flags & DIG))
|
||||
if (CPP_OPTION (pfile, c99) && (ucnranges[mn].flags & DIG))
|
||||
return 2;
|
||||
|
||||
return 1;
|
||||
@ -853,7 +937,8 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)
|
||||
|
||||
cppchar_t
|
||||
_cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
|
||||
const uchar *limit, int identifier_pos)
|
||||
const uchar *limit, int identifier_pos,
|
||||
struct normalize_state *nst)
|
||||
{
|
||||
cppchar_t result, c;
|
||||
unsigned int length;
|
||||
@ -873,7 +958,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
|
||||
else if (str[-1] == 'U')
|
||||
length = 8;
|
||||
else
|
||||
abort();
|
||||
{
|
||||
cpp_error (pfile, CPP_DL_ICE, "In _cpp_valid_ucn but not a UCN");
|
||||
length = 4;
|
||||
}
|
||||
|
||||
result = 0;
|
||||
do
|
||||
@ -915,10 +1003,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
|
||||
CPP_OPTION (pfile, warn_dollars) = 0;
|
||||
cpp_error (pfile, CPP_DL_PEDWARN, "'$' in identifier or number");
|
||||
}
|
||||
NORMALIZE_STATE_UPDATE_IDNUM (nst);
|
||||
}
|
||||
else if (identifier_pos)
|
||||
{
|
||||
int validity = ucn_valid_in_identifier (pfile, result);
|
||||
int validity = ucn_valid_in_identifier (pfile, result, nst);
|
||||
|
||||
if (validity == 0)
|
||||
cpp_error (pfile, CPP_DL_ERROR,
|
||||
@ -950,9 +1039,10 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
|
||||
int rval;
|
||||
struct cset_converter cvt
|
||||
= wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
|
||||
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
|
||||
|
||||
from++; /* Skip u/U. */
|
||||
ucn = _cpp_valid_ucn (pfile, &from, limit, 0);
|
||||
ucn = _cpp_valid_ucn (pfile, &from, limit, 0, &nst);
|
||||
|
||||
rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
|
||||
if (rval)
|
||||
|
@ -236,6 +236,19 @@ typedef CPPCHAR_SIGNED_T cppchar_signed_t;
|
||||
/* Style of header dependencies to generate. */
|
||||
enum cpp_deps_style { DEPS_NONE = 0, DEPS_USER, DEPS_SYSTEM };
|
||||
|
||||
/* The possible normalization levels, from most restrictive to least. */
|
||||
enum cpp_normalize_level {
|
||||
/* In NFKC. */
|
||||
normalized_KC = 0,
|
||||
/* In NFC. */
|
||||
normalized_C,
|
||||
/* In NFC, except for subsequences where being in NFC would make
|
||||
the identifier invalid. */
|
||||
normalized_identifier_C,
|
||||
/* Not normalized at all. */
|
||||
normalized_none
|
||||
};
|
||||
|
||||
/* This structure is nested inside struct cpp_reader, and
|
||||
carries all the options visible to the command line. */
|
||||
struct cpp_options
|
||||
@ -373,6 +386,10 @@ struct cpp_options
|
||||
/* Holds the name of the input character set. */
|
||||
const char *input_charset;
|
||||
|
||||
/* The minimum permitted level of normalization before a warning
|
||||
is generated. */
|
||||
enum cpp_normalize_level warn_normalize;
|
||||
|
||||
/* True to warn about precompiled header files we couldn't use. */
|
||||
bool warn_invalid_pch;
|
||||
|
||||
|
@ -153,6 +153,7 @@ cpp_create_reader (enum c_lang lang, hash_table *table,
|
||||
CPP_OPTION (pfile, dollars_in_ident) = 1;
|
||||
CPP_OPTION (pfile, warn_dollars) = 1;
|
||||
CPP_OPTION (pfile, warn_variadic_macros) = 1;
|
||||
CPP_OPTION (pfile, warn_normalize) = normalized_C;
|
||||
|
||||
/* Default CPP arithmetic to something sensible for the host for the
|
||||
benefit of dumb users like fix-header. */
|
||||
|
@ -564,8 +564,31 @@ extern unsigned char *_cpp_copy_replacement_text (const cpp_macro *,
|
||||
extern size_t _cpp_replacement_text_len (const cpp_macro *);
|
||||
|
||||
/* In charset.c. */
|
||||
|
||||
/* The normalization state at this point in the sequence.
|
||||
It starts initialized to all zeros, and at the end
|
||||
'level' is the normalization level of the sequence. */
|
||||
|
||||
struct normalize_state
|
||||
{
|
||||
/* The previous character. */
|
||||
cppchar_t previous;
|
||||
/* The combining class of the previous character. */
|
||||
unsigned char prev_class;
|
||||
/* The lowest normalization level so far. */
|
||||
enum cpp_normalize_level level;
|
||||
};
|
||||
#define INITIAL_NORMALIZE_STATE { 0, 0, normalized_KC }
|
||||
#define NORMALIZE_STATE_RESULT(st) ((st)->level)
|
||||
|
||||
/* We saw a character that matches ISIDNUM(), update a
|
||||
normalize_state appropriately. */
|
||||
#define NORMALIZE_STATE_UPDATE_IDNUM(st) \
|
||||
((st)->previous = 0, (st)->prev_class = 0)
|
||||
|
||||
extern cppchar_t _cpp_valid_ucn (cpp_reader *, const unsigned char **,
|
||||
const unsigned char *, int);
|
||||
const unsigned char *, int,
|
||||
struct normalize_state *state);
|
||||
extern void _cpp_destroy_iconv (cpp_reader *);
|
||||
extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
|
||||
unsigned char *, size_t, size_t,
|
||||
|
75
libcpp/lex.c
75
libcpp/lex.c
@ -53,9 +53,6 @@ static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
|
||||
static void add_line_note (cpp_buffer *, const uchar *, unsigned int);
|
||||
static int skip_line_comment (cpp_reader *);
|
||||
static void skip_whitespace (cpp_reader *, cppchar_t);
|
||||
static cpp_hashnode *lex_identifier (cpp_reader *, const uchar *, bool);
|
||||
static void lex_number (cpp_reader *, cpp_string *);
|
||||
static bool forms_identifier_p (cpp_reader *, int);
|
||||
static void lex_string (cpp_reader *, cpp_token *, const uchar *);
|
||||
static void save_comment (cpp_reader *, cpp_token *, const uchar *, cppchar_t);
|
||||
static void create_literal (cpp_reader *, cpp_token *, const uchar *,
|
||||
@ -430,10 +427,36 @@ name_p (cpp_reader *pfile, const cpp_string *string)
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* After parsing an identifier or other sequence, produce a warning about
|
||||
sequences not in NFC/NFKC. */
|
||||
static void
|
||||
warn_about_normalization (cpp_reader *pfile,
|
||||
const cpp_token *token,
|
||||
const struct normalize_state *s)
|
||||
{
|
||||
if (CPP_OPTION (pfile, warn_normalize) < NORMALIZE_STATE_RESULT (s)
|
||||
&& !pfile->state.skipping)
|
||||
{
|
||||
/* Make sure that the token is printed using UCNs, even
|
||||
if we'd otherwise happily print UTF-8. */
|
||||
unsigned char *buf = xmalloc (cpp_token_len (token));
|
||||
size_t sz;
|
||||
|
||||
sz = cpp_spell_token (pfile, token, buf, false) - buf;
|
||||
if (NORMALIZE_STATE_RESULT (s) == normalized_C)
|
||||
cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
|
||||
"`%.*s' is not in NFKC", sz, buf);
|
||||
else
|
||||
cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
|
||||
"`%.*s' is not in NFC", sz, buf);
|
||||
}
|
||||
}
|
||||
|
||||
/* Returns TRUE if the sequence starting at buffer->cur is invalid in
|
||||
an identifier. FIRST is TRUE if this starts an identifier. */
|
||||
static bool
|
||||
forms_identifier_p (cpp_reader *pfile, int first)
|
||||
forms_identifier_p (cpp_reader *pfile, int first,
|
||||
struct normalize_state *state)
|
||||
{
|
||||
cpp_buffer *buffer = pfile->buffer;
|
||||
|
||||
@ -457,7 +480,8 @@ forms_identifier_p (cpp_reader *pfile, int first)
|
||||
&& (buffer->cur[1] == 'u' || buffer->cur[1] == 'U'))
|
||||
{
|
||||
buffer->cur += 2;
|
||||
if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first))
|
||||
if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
|
||||
state))
|
||||
return true;
|
||||
buffer->cur -= 2;
|
||||
}
|
||||
@ -467,7 +491,8 @@ forms_identifier_p (cpp_reader *pfile, int first)
|
||||
|
||||
/* Lex an identifier starting at BUFFER->CUR - 1. */
|
||||
static cpp_hashnode *
|
||||
lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
|
||||
lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
|
||||
struct normalize_state *nst)
|
||||
{
|
||||
cpp_hashnode *result;
|
||||
const uchar *cur;
|
||||
@ -482,13 +507,16 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
|
||||
cur++;
|
||||
}
|
||||
pfile->buffer->cur = cur;
|
||||
if (starts_ucn || forms_identifier_p (pfile, false))
|
||||
if (starts_ucn || forms_identifier_p (pfile, false, nst))
|
||||
{
|
||||
/* Slower version for identifiers containing UCNs (or $). */
|
||||
do {
|
||||
while (ISIDNUM (*pfile->buffer->cur))
|
||||
{
|
||||
pfile->buffer->cur++;
|
||||
} while (forms_identifier_p (pfile, false));
|
||||
NORMALIZE_STATE_UPDATE_IDNUM (nst);
|
||||
}
|
||||
} while (forms_identifier_p (pfile, false, nst));
|
||||
result = _cpp_interpret_identifier (pfile, base,
|
||||
pfile->buffer->cur - base);
|
||||
}
|
||||
@ -524,7 +552,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
|
||||
|
||||
/* Lex a number to NUMBER starting at BUFFER->CUR - 1. */
|
||||
static void
|
||||
lex_number (cpp_reader *pfile, cpp_string *number)
|
||||
lex_number (cpp_reader *pfile, cpp_string *number,
|
||||
struct normalize_state *nst)
|
||||
{
|
||||
const uchar *cur;
|
||||
const uchar *base;
|
||||
@ -537,11 +566,14 @@ lex_number (cpp_reader *pfile, cpp_string *number)
|
||||
|
||||
/* N.B. ISIDNUM does not include $. */
|
||||
while (ISIDNUM (*cur) || *cur == '.' || VALID_SIGN (*cur, cur[-1]))
|
||||
{
|
||||
cur++;
|
||||
NORMALIZE_STATE_UPDATE_IDNUM (nst);
|
||||
}
|
||||
|
||||
pfile->buffer->cur = cur;
|
||||
}
|
||||
while (forms_identifier_p (pfile, false));
|
||||
while (forms_identifier_p (pfile, false, nst));
|
||||
|
||||
number->len = cur - base;
|
||||
dest = _cpp_unaligned_alloc (pfile, number->len + 1);
|
||||
@ -897,9 +929,13 @@ _cpp_lex_direct (cpp_reader *pfile)
|
||||
|
||||
case '0': case '1': case '2': case '3': case '4':
|
||||
case '5': case '6': case '7': case '8': case '9':
|
||||
{
|
||||
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
|
||||
result->type = CPP_NUMBER;
|
||||
lex_number (pfile, &result->val.str);
|
||||
lex_number (pfile, &result->val.str, &nst);
|
||||
warn_about_normalization (pfile, result, &nst);
|
||||
break;
|
||||
}
|
||||
|
||||
case 'L':
|
||||
/* 'L' may introduce wide characters or strings. */
|
||||
@ -922,7 +958,12 @@ _cpp_lex_direct (cpp_reader *pfile)
|
||||
case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
|
||||
case 'Y': case 'Z':
|
||||
result->type = CPP_NAME;
|
||||
result->val.node = lex_identifier (pfile, buffer->cur - 1, false);
|
||||
{
|
||||
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
|
||||
result->val.node = lex_identifier (pfile, buffer->cur - 1, false,
|
||||
&nst);
|
||||
warn_about_normalization (pfile, result, &nst);
|
||||
}
|
||||
|
||||
/* Convert named operators to their proper types. */
|
||||
if (result->val.node->flags & NODE_OPERATOR)
|
||||
@ -1067,8 +1108,10 @@ _cpp_lex_direct (cpp_reader *pfile)
|
||||
result->type = CPP_DOT;
|
||||
if (ISDIGIT (*buffer->cur))
|
||||
{
|
||||
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
|
||||
result->type = CPP_NUMBER;
|
||||
lex_number (pfile, &result->val.str);
|
||||
lex_number (pfile, &result->val.str, &nst);
|
||||
warn_about_normalization (pfile, result, &nst);
|
||||
}
|
||||
else if (*buffer->cur == '.' && buffer->cur[1] == '.')
|
||||
buffer->cur += 2, result->type = CPP_ELLIPSIS;
|
||||
@ -1151,11 +1194,13 @@ _cpp_lex_direct (cpp_reader *pfile)
|
||||
case '\\':
|
||||
{
|
||||
const uchar *base = --buffer->cur;
|
||||
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
|
||||
|
||||
if (forms_identifier_p (pfile, true))
|
||||
if (forms_identifier_p (pfile, true, &nst))
|
||||
{
|
||||
result->type = CPP_NAME;
|
||||
result->val.node = lex_identifier (pfile, base, true);
|
||||
result->val.node = lex_identifier (pfile, base, true, &nst);
|
||||
warn_about_normalization (pfile, result, &nst);
|
||||
break;
|
||||
}
|
||||
buffer->cur++;
|
||||
|
342
libcpp/makeucnid.c
Normal file
342
libcpp/makeucnid.c
Normal file
@ -0,0 +1,342 @@
|
||||
/* Make ucnid.h from various sources.
|
||||
Copyright (C) 2005 Free Software Foundation, Inc.
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it
|
||||
under the terms of the GNU General Public License as published by the
|
||||
Free Software Foundation; either version 2, or (at your option) any
|
||||
later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */
|
||||
|
||||
/* Run this program as
|
||||
./makeucnid ucnid.tab UnicodeData.txt DerivedNormalizationProps.txt \
|
||||
> ucnid.h
|
||||
*/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <ctype.h>
|
||||
#include <stdbool.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
enum {
|
||||
C99 = 1,
|
||||
CXX = 2,
|
||||
digit = 4,
|
||||
not_NFC = 8,
|
||||
not_NFKC = 16,
|
||||
maybe_not_NFC = 32
|
||||
};
|
||||
|
||||
static unsigned flags[65536];
|
||||
static unsigned short decomp[65536][2];
|
||||
static unsigned char combining_value[65536];
|
||||
|
||||
/* Die! */
|
||||
|
||||
static void
|
||||
fail (const char *s)
|
||||
{
|
||||
fprintf (stderr, "%s\n", s);
|
||||
exit (1);
|
||||
}
|
||||
|
||||
/* Read ucnid.tab and set the C99 and CXX flags in header[]. */
|
||||
|
||||
static void
|
||||
read_ucnid (const char *fname)
|
||||
{
|
||||
FILE *f = fopen (fname, "r");
|
||||
unsigned fl = 0;
|
||||
|
||||
if (!f)
|
||||
fail ("opening ucnid.tab");
|
||||
for (;;)
|
||||
{
|
||||
char line[256];
|
||||
|
||||
if (!fgets (line, sizeof (line), f))
|
||||
break;
|
||||
if (strcmp (line, "[C99]\n") == 0)
|
||||
fl = C99;
|
||||
else if (strcmp (line, "[CXX]\n") == 0)
|
||||
fl = CXX;
|
||||
else if (isxdigit (line[0]))
|
||||
{
|
||||
char *l = line;
|
||||
while (*l)
|
||||
{
|
||||
unsigned long start, end;
|
||||
char *endptr;
|
||||
start = strtoul (l, &endptr, 16);
|
||||
if (endptr == l || (*endptr != '-' && ! isspace (*endptr)))
|
||||
fail ("parsing ucnid.tab [1]");
|
||||
l = endptr;
|
||||
if (*l != '-')
|
||||
end = start;
|
||||
else
|
||||
{
|
||||
end = strtoul (l + 1, &endptr, 16);
|
||||
if (end < start)
|
||||
fail ("parsing ucnid.tab, end before start");
|
||||
l = endptr;
|
||||
if (! isspace (*l))
|
||||
fail ("parsing ucnid.tab, junk after range");
|
||||
}
|
||||
while (isspace (*l))
|
||||
l++;
|
||||
if (end > 0xFFFF)
|
||||
fail ("parsing ucnid.tab, end too large");
|
||||
while (start <= end)
|
||||
flags[start++] |= fl;
|
||||
}
|
||||
}
|
||||
}
|
||||
if (ferror (f))
|
||||
fail ("reading ucnid.tab");
|
||||
fclose (f);
|
||||
}
|
||||
|
||||
/* Read UnicodeData.txt and set the 'digit' flag, and
|
||||
also fill in the 'decomp' table to be the decompositions of
|
||||
characters for which both the character decomposed and all the code
|
||||
points in the decomposition are either C99 or CXX. */
|
||||
|
||||
static void
|
||||
read_table (char *fname)
|
||||
{
|
||||
FILE * f = fopen (fname, "r");
|
||||
|
||||
if (!f)
|
||||
fail ("opening UnicodeData.txt");
|
||||
for (;;)
|
||||
{
|
||||
char line[256];
|
||||
unsigned long codepoint, this_decomp[4];
|
||||
char *l;
|
||||
int i;
|
||||
int decomp_useful;
|
||||
|
||||
if (!fgets (line, sizeof (line), f))
|
||||
break;
|
||||
codepoint = strtoul (line, &l, 16);
|
||||
if (l == line || *l != ';')
|
||||
fail ("parsing UnicodeData.txt, reading code point");
|
||||
if (codepoint > 0xffff || ! (flags[codepoint] & (C99 | CXX)))
|
||||
continue;
|
||||
|
||||
do {
|
||||
l++;
|
||||
} while (*l != ';');
|
||||
/* Category value; things starting with 'N' are numbers of some
|
||||
kind. */
|
||||
if (*++l == 'N')
|
||||
flags[codepoint] |= digit;
|
||||
|
||||
do {
|
||||
l++;
|
||||
} while (*l != ';');
|
||||
/* Canonical combining class; in NFC/NFKC, they must be increasing
|
||||
(or zero). */
|
||||
if (! isdigit (*++l))
|
||||
fail ("parsing UnicodeData.txt, combining class not number");
|
||||
combining_value[codepoint] = strtoul (l, &l, 10);
|
||||
if (*l++ != ';')
|
||||
fail ("parsing UnicodeData.txt, junk after combining class");
|
||||
|
||||
/* Skip over bidi value. */
|
||||
do {
|
||||
l++;
|
||||
} while (*l != ';');
|
||||
|
||||
/* Decomposition mapping. */
|
||||
decomp_useful = flags[codepoint];
|
||||
if (*++l == '<') /* Compatibility mapping. */
|
||||
continue;
|
||||
for (i = 0; i < 4; i++)
|
||||
{
|
||||
if (*l == ';')
|
||||
break;
|
||||
if (!isxdigit (*l))
|
||||
fail ("parsing UnicodeData.txt, decomposition format");
|
||||
this_decomp[i] = strtoul (l, &l, 16);
|
||||
decomp_useful &= flags[this_decomp[i]];
|
||||
while (isspace (*l))
|
||||
l++;
|
||||
}
|
||||
if (i > 2) /* Decomposition too long. */
|
||||
fail ("parsing UnicodeData.txt, decomposition too long");
|
||||
if (decomp_useful)
|
||||
while (--i >= 0)
|
||||
decomp[codepoint][i] = this_decomp[i];
|
||||
}
|
||||
if (ferror (f))
|
||||
fail ("reading UnicodeData.txt");
|
||||
fclose (f);
|
||||
}
|
||||
|
||||
/* Read DerivedNormalizationProps.txt and set the flags that say whether
|
||||
a character is in NFC, NFKC, or is context-dependent. */
|
||||
|
||||
static void
|
||||
read_derived (const char *fname)
|
||||
{
|
||||
FILE * f = fopen (fname, "r");
|
||||
|
||||
if (!f)
|
||||
fail ("opening DerivedNormalizationProps.txt");
|
||||
for (;;)
|
||||
{
|
||||
char line[256];
|
||||
unsigned long start, end;
|
||||
char *l;
|
||||
bool not_NFC_p, not_NFKC_p, maybe_not_NFC_p;
|
||||
|
||||
if (!fgets (line, sizeof (line), f))
|
||||
break;
|
||||
not_NFC_p = (strstr (line, "; NFC_QC; N") != NULL);
|
||||
not_NFKC_p = (strstr (line, "; NFKC_QC; N") != NULL);
|
||||
maybe_not_NFC_p = (strstr (line, "; NFC_QC; M") != NULL);
|
||||
if (! not_NFC_p && ! not_NFKC_p && ! maybe_not_NFC_p)
|
||||
continue;
|
||||
|
||||
start = strtoul (line, &l, 16);
|
||||
if (l == line)
|
||||
fail ("parsing DerivedNormalizationProps.txt, reading start");
|
||||
if (start > 0xffff)
|
||||
continue;
|
||||
if (*l == '.' && l[1] == '.')
|
||||
end = strtoul (l + 2, &l, 16);
|
||||
else
|
||||
end = start;
|
||||
|
||||
while (start <= end)
|
||||
flags[start++] |= ((not_NFC_p ? not_NFC : 0)
|
||||
| (not_NFKC_p ? not_NFKC : 0)
|
||||
| (maybe_not_NFC_p ? maybe_not_NFC : 0)
|
||||
);
|
||||
}
|
||||
if (ferror (f))
|
||||
fail ("reading DerivedNormalizationProps.txt");
|
||||
fclose (f);
|
||||
}
|
||||
|
||||
/* Write out the table.
|
||||
The table consists of two words per entry. The first word is the flags
|
||||
for the unicode code points up to and including the second word. */
|
||||
|
||||
static void
|
||||
write_table (void)
|
||||
{
|
||||
unsigned i;
|
||||
unsigned last_flag = flags[0];
|
||||
bool really_safe = decomp[0][0] == 0;
|
||||
unsigned char last_combine = combining_value[0];
|
||||
|
||||
for (i = 1; i <= 65536; i++)
|
||||
if (i == 65536
|
||||
|| (flags[i] != last_flag && ((flags[i] | last_flag) & (C99 | CXX)))
|
||||
|| really_safe != (decomp[i][0] == 0)
|
||||
|| combining_value[i] != last_combine)
|
||||
{
|
||||
printf ("{ %s|%s|%s|%s|%s|%s|%s, %3d, %#06x },\n",
|
||||
last_flag & C99 ? "C99" : " 0",
|
||||
last_flag & digit ? "DIG" : " 0",
|
||||
last_flag & CXX ? "CXX" : " 0",
|
||||
really_safe ? "CID" : " 0",
|
||||
last_flag & not_NFC ? " 0" : "NFC",
|
||||
last_flag & not_NFKC ? " 0" : "NKC",
|
||||
last_flag & maybe_not_NFC ? "CTX" : " 0",
|
||||
combining_value[i - 1],
|
||||
i - 1);
|
||||
last_flag = flags[i];
|
||||
last_combine = combining_value[0];
|
||||
really_safe = decomp[i][0] == 0;
|
||||
}
|
||||
}
|
||||
|
||||
/* Print out the huge copyright notice. */
|
||||
|
||||
static void
|
||||
write_copyright (void)
|
||||
{
|
||||
static const char copyright[] = "\
|
||||
/* Unicode characters and various properties.\n\
|
||||
Copyright (C) 2003, 2005 Free Software Foundation, Inc.\n\
|
||||
\n\
|
||||
This program is free software; you can redistribute it and/or modify it\n\
|
||||
under the terms of the GNU General Public License as published by the\n\
|
||||
Free Software Foundation; either version 2, or (at your option) any\n\
|
||||
later version.\n\
|
||||
\n\
|
||||
This program is distributed in the hope that it will be useful,\n\
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of\n\
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n\
|
||||
GNU General Public License for more details.\n\
|
||||
\n\
|
||||
You should have received a copy of the GNU General Public License\n\
|
||||
along with this program; if not, write to the Free Software\n\
|
||||
Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.\n\
|
||||
\n\
|
||||
\n\
|
||||
Copyright (C) 1991-2005 Unicode, Inc. All rights reserved.\n\
|
||||
Distributed under the Terms of Use in\n\
|
||||
http://www.unicode.org/copyright.html.\n\
|
||||
\n\
|
||||
Permission is hereby granted, free of charge, to any person\n\
|
||||
obtaining a copy of the Unicode data files and any associated\n\
|
||||
documentation (the \"Data Files\") or Unicode software and any\n\
|
||||
associated documentation (the \"Software\") to deal in the Data Files\n\
|
||||
or Software without restriction, including without limitation the\n\
|
||||
rights to use, copy, modify, merge, publish, distribute, and/or\n\
|
||||
sell copies of the Data Files or Software, and to permit persons to\n\
|
||||
whom the Data Files or Software are furnished to do so, provided\n\
|
||||
that (a) the above copyright notice(s) and this permission notice\n\
|
||||
appear with all copies of the Data Files or Software, (b) both the\n\
|
||||
above copyright notice(s) and this permission notice appear in\n\
|
||||
associated documentation, and (c) there is clear notice in each\n\
|
||||
modified Data File or in the Software as well as in the\n\
|
||||
documentation associated with the Data File(s) or Software that the\n\
|
||||
data or software has been modified.\n\
|
||||
\n\
|
||||
THE DATA FILES AND SOFTWARE ARE PROVIDED \"AS IS\", WITHOUT WARRANTY\n\
|
||||
OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE\n\
|
||||
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND\n\
|
||||
NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE\n\
|
||||
COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR\n\
|
||||
ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY\n\
|
||||
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,\n\
|
||||
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS\n\
|
||||
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE\n\
|
||||
OF THE DATA FILES OR SOFTWARE.\n\
|
||||
\n\
|
||||
Except as contained in this notice, the name of a copyright holder\n\
|
||||
shall not be used in advertising or otherwise to promote the sale,\n\
|
||||
use or other dealings in these Data Files or Software without prior\n\
|
||||
written authorization of the copyright holder. */\n";
|
||||
|
||||
puts (copyright);
|
||||
}
|
||||
|
||||
/* Main program. */
|
||||
|
||||
int
|
||||
main(int argc, char ** argv)
|
||||
{
|
||||
if (argc != 4)
|
||||
fail ("too few arguments to makeucn");
|
||||
read_ucnid (argv[1]);
|
||||
read_table (argv[2]);
|
||||
read_derived (argv[3]);
|
||||
|
||||
write_copyright ();
|
||||
write_table ();
|
||||
return 0;
|
||||
}
|
1099
libcpp/ucnid.h
1099
libcpp/ucnid.h
File diff suppressed because it is too large
Load Diff
130
libcpp/ucnid.pl
130
libcpp/ucnid.pl
@ -1,130 +0,0 @@
|
||||
#! /usr/bin/perl -w
|
||||
use strict;
|
||||
|
||||
# Convert cppucnid.tab to cppucnid.h. We use two arrays of length
|
||||
# 65536 to represent the table, since this is nice and simple. The
|
||||
# first array holds the tags indicating which ranges are valid in
|
||||
# which contexts. The second array holds the language name associated
|
||||
# with each element.
|
||||
|
||||
our(@tags, @names);
|
||||
@tags = ("") x 65536;
|
||||
@names = ("") x 65536;
|
||||
|
||||
|
||||
# Array mapping tag numbers to standard #defines
|
||||
our @stds;
|
||||
|
||||
# Current standard and language
|
||||
our($curstd, $curlang);
|
||||
|
||||
# First block of the file is a template to be saved for later.
|
||||
our @template;
|
||||
|
||||
while (<>) {
|
||||
chomp;
|
||||
last if $_ eq '%%';
|
||||
push @template, $_;
|
||||
};
|
||||
|
||||
# Second block of the file is the UCN tables.
|
||||
# The format looks like this:
|
||||
#
|
||||
# [std]
|
||||
#
|
||||
# ; language
|
||||
# xxxx-xxxx xxxx xxxx-xxxx ....
|
||||
#
|
||||
# with comment lines starting with #.
|
||||
|
||||
while (<>) {
|
||||
chomp;
|
||||
/^#/ and next;
|
||||
/^\s*$/ and next;
|
||||
/^\[(.+)\]$/ and do {
|
||||
$curstd = $1;
|
||||
next;
|
||||
};
|
||||
/^; (.+)$/ and do {
|
||||
$curlang = $1;
|
||||
next;
|
||||
};
|
||||
|
||||
process_range(split);
|
||||
}
|
||||
|
||||
# Print out the template, inserting as requested.
|
||||
$\ = "\n";
|
||||
for (@template) {
|
||||
print("/* Automatically generated from cppucnid.tab, do not edit */"),
|
||||
next if $_ eq "[dne]";
|
||||
print_table(), next if $_ eq "[table]";
|
||||
print;
|
||||
}
|
||||
|
||||
sub print_table {
|
||||
my($lo, $hi);
|
||||
my $prevname = "";
|
||||
|
||||
for ($lo = 0; $lo <= $#tags; $lo = $hi) {
|
||||
$hi = $lo;
|
||||
$hi++ while $hi <= $#tags
|
||||
&& $tags[$hi] eq $tags[$lo]
|
||||
&& $names[$hi] eq $names[$lo];
|
||||
|
||||
# Range from $lo to $hi-1.
|
||||
# Don't make entries for ranges that are not valid idchars.
|
||||
next if ($tags[$lo] eq "");
|
||||
my $tag = $tags[$lo];
|
||||
$tag = " ".$tag if $tag =~ /^C99/;
|
||||
|
||||
if ($names[$lo] eq $prevname) {
|
||||
printf(" { 0x%04x, 0x%04x, %-11s },\n",
|
||||
$lo, $hi-1, $tag);
|
||||
} else {
|
||||
printf(" { 0x%04x, 0x%04x, %-11s }, /* %s */\n",
|
||||
$lo, $hi-1, $tag, $names[$lo]);
|
||||
}
|
||||
$prevname = $names[$lo];
|
||||
}
|
||||
}
|
||||
|
||||
# The line is a list of four-digit hexadecimal numbers or
|
||||
# pairs of such numbers. Each is a valid identifier character
|
||||
# from the given language, under the given standard.
|
||||
sub process_range {
|
||||
for my $range (@_) {
|
||||
if ($range =~ /^[0-9a-f]{4}$/) {
|
||||
my $i = hex($range);
|
||||
if ($tags[$i] eq "") {
|
||||
$tags[$i] = $curstd;
|
||||
} else {
|
||||
$tags[$i] = $curstd . "|" . $tags[$i];
|
||||
}
|
||||
if ($names[$i] ne "" && $names[$i] ne $curlang) {
|
||||
warn sprintf ("language overlap: %s/%s at %x (tag %d)",
|
||||
$names[$i], $curlang, $i, $tags[$i]);
|
||||
next;
|
||||
}
|
||||
$names[$i] = $curlang;
|
||||
} elsif ($range =~ /^ ([0-9a-f]{4}) - ([0-9a-f]{4}) $/x) {
|
||||
my ($start, $end) = (hex($1), hex($2));
|
||||
my $i;
|
||||
for ($i = $start; $i <= $end; $i++) {
|
||||
if ($tags[$i] eq "") {
|
||||
$tags[$i] = $curstd;
|
||||
} else {
|
||||
$tags[$i] = $curstd . "|" . $tags[$i];
|
||||
}
|
||||
if ($names[$i] ne "" && $names[$i] ne $curlang) {
|
||||
warn sprintf ("language overlap: %s/%s at %x (tag %d)",
|
||||
$names[$i], $curlang, $i, $tags[$i]);
|
||||
next;
|
||||
}
|
||||
$names[$i] = $curlang;
|
||||
}
|
||||
} else {
|
||||
warn "malformed range expression $range";
|
||||
}
|
||||
}
|
||||
}
|
@ -1,47 +1,25 @@
|
||||
/* Table of UCNs which are valid in identifiers.
|
||||
Copyright (C) 2003 Free Software Foundation, Inc.
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it
|
||||
under the terms of the GNU General Public License as published by the
|
||||
Free Software Foundation; either version 2, or (at your option) any
|
||||
later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */
|
||||
|
||||
[dne]
|
||||
|
||||
/* This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
|
||||
D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
|
||||
the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
|
||||
a reproduction of ISO/IEC PDTR 10176. Unfortunately these tables
|
||||
are not identical. */
|
||||
|
||||
#ifndef LIBCPP_UCNID_H
|
||||
#define LIBCPP_UCNID_H
|
||||
|
||||
#define C99 1
|
||||
#define CXX 2
|
||||
#define DIG 4
|
||||
|
||||
struct ucnrange
|
||||
{
|
||||
unsigned short lo, hi;
|
||||
unsigned short flags;
|
||||
};
|
||||
|
||||
static const struct ucnrange ucnranges[] = {
|
||||
[table]
|
||||
};
|
||||
|
||||
#endif /* LIBCPP_UCNID_H */
|
||||
%%
|
||||
; Table of UCNs which are valid in identifiers.
|
||||
; Copyright (C) 2003, 2005 Free Software Foundation, Inc.
|
||||
;
|
||||
; This program is free software; you can redistribute it and/or modify it
|
||||
; under the terms of the GNU General Public License as published by the
|
||||
; Free Software Foundation; either version 2, or (at your option) any
|
||||
; later version.
|
||||
;
|
||||
; This program is distributed in the hope that it will be useful,
|
||||
; but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
; GNU General Public License for more details.
|
||||
;
|
||||
; You should have received a copy of the GNU General Public License
|
||||
; along with this program; if not, write to the Free Software
|
||||
; Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
|
||||
;
|
||||
; This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
|
||||
; D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
|
||||
; the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
|
||||
; a reproduction of ISO/IEC PDTR 10176. Unfortunately these tables
|
||||
; are not identical.
|
||||
|
||||
[C99]
|
||||
|
||||
@ -141,7 +119,6 @@ ac00-d7a3
|
||||
0b3d 1fbe 203f-2040 2102 2107 210a-2113 2115 2118-211d 2124 2126 2128
|
||||
212a-2131 2133-2138 2160-2182 3005-3007 3021-3029
|
||||
|
||||
[C99|DIG]
|
||||
; Digits
|
||||
0660-0669 06f0-06f9 0966-096f 09e6-09ef 0a66-0a6f 0ae6-0aef 0b66-0b6f
|
||||
0be7-0bef 0c66-0c6f 0ce6-0cef 0d66-0d6f 0e50-0e59 0ed0-0ed9 0f20-0f33
|
||||
@ -201,16 +178,12 @@ ac00-d7a3
|
||||
; Malayalam
|
||||
0d05-0d0c 0d0e-0d10 0d12-0d28 0d2a-0d39 0d60-0d61
|
||||
|
||||
# CORRECTION: Exclude 0e50-0e59 from the Thai range and make a fake
|
||||
# Digits range for it, to match C99. cppcharset.c knows that C++
|
||||
# doesn't distinguish digits from other UCNs valid in identifiers.
|
||||
; Thai
|
||||
0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e49 0e5a-0e5b
|
||||
0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e5b
|
||||
|
||||
; Digits
|
||||
0e50-0e59
|
||||
|
||||
# CORRECTION: Change 0e0d to 0e8d (typo in standard; see C++ DR 131)
|
||||
; Lao
|
||||
0e81-0e82 0e84 0e87-0e88 0e8a 0e8d 0e94-0e97 0e99-0e9f 0ea1-0ea3 0ea5
|
||||
0ea7 0eaa-0eab 0ead-0eb0 0eb2 0eb3 0ebd 0ec0-0ec4 0ec6
|
||||
@ -224,7 +197,6 @@ ac00-d7a3
|
||||
; Katakana
|
||||
30a1-30fe
|
||||
|
||||
# CORRECTION: language spelled "Bopmofo" in C++98.
|
||||
; Bopomofo
|
||||
3105-312c
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user