gcc/unicode at devel/mold-lto-plugin - gcc

History

David Malcolm b050653c4c contrib: add unicode/utf8-dump.py This script may be useful when debugging issues relating to Unicode encoding (e.g. when investigating source files with bidirectional control characters). It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's diagnostic output format), interleaved with lines per character showing the Unicode codepoints, the UTF-8 encoding bytes, the name of the character, and, where printable, the characters themselves. The lines are printed in logical order, which may help the reader to grok the relationship between visual and logical ordering in bi-di files. For example: $ cat test.c int གྷ; const char אבג = "ALEF-BET-GIMEL"; $ ./contrib/unicode/utf8-dump.py test.c 1 \| int གྷ; \| U+0069 0x69 LATIN SMALL LETTER I i \| U+006E 0x6e LATIN SMALL LETTER N n \| U+0074 0x74 LATIN SMALL LETTER T t \| U+0020 0x20 SPACE (separator) \| U+0F43 0xe0 0xbd 0x83 TIBETAN LETTER GHA གྷ \| U+003B 0x3b SEMICOLON ; \| U+000A 0x0a LINE FEED (LF) (control character) 2 \| const char אבג = "ALEF-BET-GIMEL"; \| U+0063 0x63 LATIN SMALL LETTER C c \| U+006F 0x6f LATIN SMALL LETTER O o \| U+006E 0x6e LATIN SMALL LETTER N n \| U+0073 0x73 LATIN SMALL LETTER S s \| U+0074 0x74 LATIN SMALL LETTER T t \| U+0020 0x20 SPACE (separator) \| U+0063 0x63 LATIN SMALL LETTER C c \| U+0068 0x68 LATIN SMALL LETTER H h \| U+0061 0x61 LATIN SMALL LETTER A a \| U+0072 0x72 LATIN SMALL LETTER R r \| U+0020 0x20 SPACE (separator) \| U+002A 0x2a ASTERISK * \| U+05D0 0xd7 0x90 HEBREW LETTER ALEF א \| U+05D1 0xd7 0x91 HEBREW LETTER BET ב \| U+05D2 0xd7 0x92 HEBREW LETTER GIMEL ג \| U+0020 0x20 SPACE (separator) \| U+003D 0x3d EQUALS SIGN = \| U+0020 0x20 SPACE (separator) \| U+0022 0x22 QUOTATION MARK " \| U+0041 0x41 LATIN CAPITAL LETTER A A \| U+004C 0x4c LATIN CAPITAL LETTER L L \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+0046 0x46 LATIN CAPITAL LETTER F F \| U+002D 0x2d HYPHEN-MINUS - \| U+0042 0x42 LATIN CAPITAL LETTER B B \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+0054 0x54 LATIN CAPITAL LETTER T T \| U+002D 0x2d HYPHEN-MINUS - \| U+0047 0x47 LATIN CAPITAL LETTER G G \| U+0049 0x49 LATIN CAPITAL LETTER I I \| U+004D 0x4d LATIN CAPITAL LETTER M M \| U+0045 0x45 LATIN CAPITAL LETTER E E \| U+004C 0x4c LATIN CAPITAL LETTER L L \| U+0022 0x22 QUOTATION MARK " \| U+003B 0x3b SEMICOLON ; \| U+000A 0x0a LINE FEED (LF) (control character) Tested with Python 3.8 contrib/ChangeLog: * unicode/utf8-dump.py: New file. Signed-off-by: David Malcolm <dmalcolm@redhat.com>		2021-11-01 11:52:28 -04:00
..
from_glibc	…
EastAsianWidth.txt	…
PropList.txt	…
README	…
UnicodeData.txt	…
gen_wcwidth.py	…
unicode-license.txt	…
utf8-dump.py	contrib: add unicode/utf8-dump.py	2021-11-01 11:52:28 -04:00

README

This directory contains a mechanism for GCC to have its own internal
implementation of wcwidth functionality.  (cpp_wcwidth () in libcpp/charset.c).

The idea is to produce the necessary lookup table
(../../libcpp/generated_cpp_wcwidth.h) in a reproducible way, starting from the
following files that are distributed by the Unicode Consortium:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt

These three files have been added to source control in this directory;
please see unicode-license.txt for the relevant copyright information.

In order to keep in sync with glibc's wcwidth as much as possible, it is
desirable for the logic that processes the Unicode data to be the same as
glibc's.  To that end, we also put in this directory, in the from_glibc/
directory, the glibc python code that implements their logic.  This code was
copied verbatim from glibc, and it can be updated at any time from the glibc
source code repository.  The files copied from that respository are:

localedata/unicode-gen/unicode_utils.py
localedata/unicode-gen/utf8_gen.py

And the most recent versions added to GCC are from glibc git commit:
f6032247061fb37d59565f2e9667e242c8a98e76

Finally, the script gen_wcwidth.py found here contains the GCC-specific code to
map glibc's output to the lookup tables we require.  This script should not need
to change, unless there are structural changes to the Unicode data files or to
the glibc code.

The procedure to update GCC's wcwidth tables is the following:

1.  Update the three Unicode data files from the above URLs.

2.  Update the two glibc files in from_glibc/ from glibc's git.  Update
    the commit number above in this README.

3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
    (where X.Y is the version of the Unicode standard corresponding to the
    Unicode data files being used, most recently, 13.0.0).

After that, GCC's wcwidth will match the most recent glibc.