Rollup merge of #64896 - XAMPPRocky:remove-grammar, r=Centril

Remove legacy grammar

Revival of #50835 & #55545

On the #wg-grammar discord there was agreement that enough progress has been made to be able to remove the legacy grammar.

r? @Centril @qmx

cc @rust-lang/wg-grammar
This commit is contained in:
Mazdak Farrokhzad 2019-10-01 09:55:33 +02:00 committed by GitHub
commit b8c5d3a427
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 4 additions and 3566 deletions

View File

@ -1,812 +1,7 @@
% Grammar
# Introduction
The Rust grammar may now be found in the [reference]. Additionally, the [grammar
working group] is working on producing a testable grammar.
This document is the primary reference for the Rust programming language grammar. It
provides only one kind of material:
- Chapters that formally define the language grammar.
This document does not serve as an introduction to the language. Background
familiarity with the language is assumed. A separate [guide] is available to
help acquire such background.
This document also does not serve as a reference to the [standard] library
included in the language distribution. Those libraries are documented
separately by extracting documentation attributes from their source code. Many
of the features that one might expect to be language features are library
features in Rust, so what you're looking for may be there, not here.
[guide]: guide.html
[standard]: std/index.html
# Notation
Rust's grammar is defined over Unicode codepoints, each conventionally denoted
`U+XXXX`, for 4 or more hexadecimal digits `X`. _Most_ of Rust's grammar is
confined to the ASCII range of Unicode, and is described in this document by a
dialect of Extended Backus-Naur Form (EBNF), specifically a dialect of EBNF
supported by common automated LL(k) parsing tools such as `llgen`, rather than
the dialect given in ISO 14977. The dialect can be defined self-referentially
as follows:
```antlr
grammar : rule + ;
rule : nonterminal ':' productionrule ';' ;
productionrule : production [ '|' production ] * ;
production : term * ;
term : element repeats ;
element : LITERAL | IDENTIFIER | '[' productionrule ']' ;
repeats : [ '*' | '+' ] NUMBER ? | NUMBER ? | '?' ;
```
Where:
- Whitespace in the grammar is ignored.
- Square brackets are used to group rules.
- `LITERAL` is a single printable ASCII character, or an escaped hexadecimal
ASCII code of the form `\xQQ`, in single quotes, denoting the corresponding
Unicode codepoint `U+00QQ`.
- `IDENTIFIER` is a nonempty string of ASCII letters and underscores.
- The `repeat` forms apply to the adjacent `element`, and are as follows:
- `?` means zero or one repetition
- `*` means zero or more repetitions
- `+` means one or more repetitions
- NUMBER trailing a repeat symbol gives a maximum repetition count
- NUMBER on its own gives an exact repetition count
This EBNF dialect should hopefully be familiar to many readers.
## Unicode productions
A few productions in Rust's grammar permit Unicode codepoints outside the ASCII
range. We define these productions in terms of character properties specified
in the Unicode standard, rather than in terms of ASCII-range codepoints. The
section [Special Unicode Productions](#special-unicode-productions) lists these
productions.
## String table productions
Some rules in the grammar — notably [unary
operators](#unary-operator-expressions), [binary
operators](#binary-operator-expressions), and [keywords](#keywords) — are
given in a simplified form: as a listing of a table of unquoted, printable
whitespace-separated strings. These cases form a subset of the rules regarding
the [token](#tokens) rule, and are assumed to be the result of a
lexical-analysis phase feeding the parser, driven by a DFA, operating over the
disjunction of all such string table entries.
When such a string enclosed in double-quotes (`"`) occurs inside the grammar,
it is an implicit reference to a single member of such a string table
production. See [tokens](#tokens) for more information.
# Lexical structure
## Input format
Rust input is interpreted as a sequence of Unicode codepoints encoded in UTF-8.
Most Rust grammar rules are defined in terms of printable ASCII-range
codepoints, but a small number are defined in terms of Unicode properties or
explicit codepoint lists. [^inputformat]
[^inputformat]: Substitute definitions for the special Unicode productions are
provided to the grammar verifier, restricted to ASCII range, when verifying the
grammar in this document.
## Special Unicode Productions
The following productions in the Rust grammar are defined in terms of Unicode
properties: `ident`, `non_null`, `non_eol`, `non_single_quote` and
`non_double_quote`.
### Identifiers
The `ident` production is any nonempty Unicode string of
the following form:
- The first character is in one of the following ranges `U+0041` to `U+005A`
("A" to "Z"), `U+0061` to `U+007A` ("a" to "z"), or `U+005F` ("\_").
- The remaining characters are in the range `U+0030` to `U+0039` ("0" to "9"),
or any of the prior valid initial characters.
as long as the identifier does _not_ occur in the set of [keywords](#keywords).
### Delimiter-restricted productions
Some productions are defined by exclusion of particular Unicode characters:
- `non_null` is any single Unicode character aside from `U+0000` (null)
- `non_eol` is any single Unicode character aside from `U+000A` (`'\n'`)
- `non_single_quote` is any single Unicode character aside from `U+0027` (`'`)
- `non_double_quote` is any single Unicode character aside from `U+0022` (`"`)
## Comments
```antlr
comment : block_comment | line_comment ;
block_comment : "/*" block_comment_body * "*/" ;
block_comment_body : [block_comment | character] * ;
line_comment : "//" non_eol * ;
```
**FIXME:** add doc grammar?
## Whitespace
```antlr
whitespace_char : '\x20' | '\x09' | '\x0a' | '\x0d' ;
whitespace : [ whitespace_char | comment ] + ;
```
## Tokens
```antlr
simple_token : keyword | unop | binop ;
token : simple_token | ident | literal | symbol | whitespace token ;
```
### Keywords
<p id="keyword-table-marker"></p>
| | | | | |
|----------|----------|----------|----------|----------|
| _ | abstract | alignof | as | become |
| box | break | const | continue | crate |
| do | else | enum | extern | false |
| final | fn | for | if | impl |
| in | let | loop | macro | match |
| mod | move | mut | offsetof | override |
| priv | proc | pub | pure | ref |
| return | Self | self | sizeof | static |
| struct | super | trait | true | type |
| typeof | unsafe | unsized | use | virtual |
| where | while | yield | | |
Each of these keywords has special meaning in its grammar, and all of them are
excluded from the `ident` rule.
Not all of these keywords are used by the language. Some of them were used
before Rust 1.0, and were left reserved once their implementations were
removed. Some of them were reserved before 1.0 to make space for possible
future features.
### Literals
```antlr
lit_suffix : ident;
literal : [ string_lit | char_lit | byte_string_lit | byte_lit | num_lit | bool_lit ] lit_suffix ?;
```
The optional `lit_suffix` production is only used for certain numeric literals,
but is reserved for future extension. That is, the above gives the lexical
grammar, but a Rust parser will reject everything but the 12 special cases
mentioned in [Number literals](reference/tokens.html#number-literals) in the
reference.
#### Character and string literals
```antlr
char_lit : '\x27' char_body '\x27' ;
string_lit : '"' string_body * '"' | 'r' raw_string ;
char_body : non_single_quote
| '\x5c' [ '\x27' | common_escape | unicode_escape ] ;
string_body : non_double_quote
| '\x5c' [ '\x22' | common_escape | unicode_escape ] ;
raw_string : '"' raw_string_body '"' | '#' raw_string '#' ;
common_escape : '\x5c'
| 'n' | 'r' | 't' | '0'
| 'x' hex_digit 2
unicode_escape : 'u' '{' hex_digit+ 6 '}';
hex_digit : 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
| 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
| dec_digit ;
oct_digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' ;
dec_digit : '0' | nonzero_dec ;
nonzero_dec: '1' | '2' | '3' | '4'
| '5' | '6' | '7' | '8' | '9' ;
```
#### Byte and byte string literals
```antlr
byte_lit : "b\x27" byte_body '\x27' ;
byte_string_lit : "b\x22" string_body * '\x22' | "br" raw_byte_string ;
byte_body : ascii_non_single_quote
| '\x5c' [ '\x27' | common_escape ] ;
byte_string_body : ascii_non_double_quote
| '\x5c' [ '\x22' | common_escape ] ;
raw_byte_string : '"' raw_byte_string_body '"' | '#' raw_byte_string '#' ;
```
#### Number literals
```antlr
num_lit : nonzero_dec [ dec_digit | '_' ] * float_suffix ?
| '0' [ [ dec_digit | '_' ] * float_suffix ?
| 'b' [ '1' | '0' | '_' ] +
| 'o' [ oct_digit | '_' ] +
| 'x' [ hex_digit | '_' ] + ] ;
float_suffix : [ exponent | '.' dec_lit exponent ? ] ? ;
exponent : ['E' | 'e'] ['-' | '+' ] ? dec_lit ;
dec_lit : [ dec_digit | '_' ] + ;
```
#### Boolean literals
```antlr
bool_lit : [ "true" | "false" ] ;
```
The two values of the boolean type are written `true` and `false`.
### Symbols
```antlr
symbol : "::" | "->"
| '#' | '[' | ']' | '(' | ')' | '{' | '}'
| ',' | ';' ;
```
Symbols are a general class of printable [tokens](#tokens) that play structural
roles in a variety of grammar productions. They are cataloged here for
completeness as the set of remaining miscellaneous printable tokens that do not
otherwise appear as [unary operators](#unary-operator-expressions), [binary
operators](#binary-operator-expressions), or [keywords](#keywords).
## Paths
```antlr
expr_path : [ "::" ] ident [ "::" expr_path_tail ] + ;
expr_path_tail : '<' type_expr [ ',' type_expr ] + '>'
| expr_path ;
type_path : ident [ type_path_tail ] + ;
type_path_tail : '<' type_expr [ ',' type_expr ] + '>'
| "::" type_path ;
```
# Syntax extensions
## Macros
```antlr
expr_macro_rules : "macro_rules" '!' ident '(' macro_rule * ')' ';'
| "macro_rules" '!' ident '{' macro_rule * '}' ;
macro_rule : '(' matcher * ')' "=>" '(' transcriber * ')' ';' ;
matcher : '(' matcher * ')' | '[' matcher * ']'
| '{' matcher * '}' | '$' ident ':' ident
| '$' '(' matcher * ')' sep_token? [ '*' | '+' ]
| non_special_token ;
transcriber : '(' transcriber * ')' | '[' transcriber * ']'
| '{' transcriber * '}' | '$' ident
| '$' '(' transcriber * ')' sep_token? [ '*' | '+' ]
| non_special_token ;
```
# Crates and source files
**FIXME:** grammar? What production covers #![crate_id = "foo"] ?
# Items and attributes
**FIXME:** grammar?
## Items
```antlr
item : vis ? mod_item | fn_item | type_item | struct_item | enum_item
| const_item | static_item | trait_item | impl_item | extern_block_item ;
```
### Type Parameters
**FIXME:** grammar?
### Modules
```antlr
mod_item : "mod" ident ( ';' | '{' mod '}' );
mod : [ view_item | item ] * ;
```
#### View items
```antlr
view_item : extern_crate_decl | use_decl ';' ;
```
##### Extern crate declarations
```antlr
extern_crate_decl : "extern" "crate" crate_name
crate_name: ident | ( ident "as" ident )
```
##### Use declarations
```antlr
use_decl : vis ? "use" [ path "as" ident
| path_glob ] ;
path_glob : ident [ "::" [ path_glob
| '*' ] ] ?
| '{' path_item [ ',' path_item ] * '}' ;
path_item : ident | "self" ;
```
### Functions
**FIXME:** grammar?
#### Generic functions
**FIXME:** grammar?
#### Unsafety
**FIXME:** grammar?
##### Unsafe functions
**FIXME:** grammar?
##### Unsafe blocks
**FIXME:** grammar?
#### Diverging functions
**FIXME:** grammar?
### Type definitions
**FIXME:** grammar?
### Structures
**FIXME:** grammar?
### Enumerations
**FIXME:** grammar?
### Constant items
```antlr
const_item : "const" ident ':' type '=' expr ';' ;
```
### Static items
```antlr
static_item : "static" ident ':' type '=' expr ';' ;
```
#### Mutable statics
**FIXME:** grammar?
### Traits
**FIXME:** grammar?
### Implementations
**FIXME:** grammar?
### External blocks
```antlr
extern_block_item : "extern" '{' extern_block '}' ;
extern_block : [ foreign_fn ] * ;
```
## Visibility and Privacy
```antlr
vis : "pub" ;
```
### Re-exporting and Visibility
See [Use declarations](#use-declarations).
## Attributes
```antlr
attribute : '#' '!' ? '[' meta_item ']' ;
meta_item : ident [ '=' literal
| '(' meta_seq ')' ] ? ;
meta_seq : meta_item [ ',' meta_seq ] ? ;
```
# Statements and expressions
## Statements
```antlr
stmt : decl_stmt | expr_stmt | ';' ;
```
### Declaration statements
```antlr
decl_stmt : item | let_decl ;
```
#### Item declarations
See [Items](#items).
#### Variable declarations
```antlr
let_decl : "let" pat [':' type ] ? [ init ] ? ';' ;
init : [ '=' ] expr ;
```
### Expression statements
```antlr
expr_stmt : expr ';' ;
```
## Expressions
```antlr
expr : literal | path | tuple_expr | unit_expr | struct_expr
| block_expr | method_call_expr | field_expr | array_expr
| idx_expr | range_expr | unop_expr | binop_expr
| paren_expr | call_expr | lambda_expr | while_expr
| loop_expr | break_expr | continue_expr | for_expr
| if_expr | match_expr | if_let_expr | while_let_expr
| return_expr ;
```
#### Lvalues, rvalues and temporaries
**FIXME:** grammar?
#### Moved and copied types
**FIXME:** Do we want to capture this in the grammar as different productions?
### Literal expressions
See [Literals](#literals).
### Path expressions
See [Paths](#paths).
### Tuple expressions
```antlr
tuple_expr : '(' [ expr [ ',' expr ] * | expr ',' ] ? ')' ;
```
### Unit expressions
```antlr
unit_expr : "()" ;
```
### Structure expressions
```antlr
struct_expr_field_init : ident | ident ':' expr ;
struct_expr : expr_path '{' struct_expr_field_init
[ ',' struct_expr_field_init ] *
[ ".." expr ] '}' |
expr_path '(' expr
[ ',' expr ] * ')' |
expr_path ;
```
### Block expressions
```antlr
block_expr : '{' [ stmt | item ] *
[ expr ] '}' ;
```
### Method-call expressions
```antlr
method_call_expr : expr '.' ident paren_expr_list ;
```
### Field expressions
```antlr
field_expr : expr '.' ident ;
```
### Array expressions
```antlr
array_expr : '[' "mut" ? array_elems? ']' ;
array_elems : [expr [',' expr]*] | [expr ';' expr] ;
```
### Index expressions
```antlr
idx_expr : expr '[' expr ']' ;
```
### Range expressions
```antlr
range_expr : expr ".." expr |
expr ".." |
".." expr |
".." ;
```
### Unary operator expressions
```antlr
unop_expr : unop expr ;
unop : '-' | '*' | '!' ;
```
### Binary operator expressions
```antlr
binop_expr : expr binop expr | type_cast_expr
| assignment_expr | compound_assignment_expr ;
binop : arith_op | bitwise_op | lazy_bool_op | comp_op
```
#### Arithmetic operators
```antlr
arith_op : '+' | '-' | '*' | '/' | '%' ;
```
#### Bitwise operators
```antlr
bitwise_op : '&' | '|' | '^' | "<<" | ">>" ;
```
#### Lazy boolean operators
```antlr
lazy_bool_op : "&&" | "||" ;
```
#### Comparison operators
```antlr
comp_op : "==" | "!=" | '<' | '>' | "<=" | ">=" ;
```
#### Type cast expressions
```antlr
type_cast_expr : value "as" type ;
```
#### Assignment expressions
```antlr
assignment_expr : expr '=' expr ;
```
#### Compound assignment expressions
```antlr
compound_assignment_expr : expr [ arith_op | bitwise_op ] '=' expr ;
```
### Grouped expressions
```antlr
paren_expr : '(' expr ')' ;
```
### Call expressions
```antlr
expr_list : [ expr [ ',' expr ]* ] ? ;
paren_expr_list : '(' expr_list ')' ;
call_expr : expr paren_expr_list ;
```
### Lambda expressions
```antlr
ident_list : [ ident [ ',' ident ]* ] ? ;
lambda_expr : '|' ident_list '|' expr ;
```
### While loops
```antlr
while_expr : [ lifetime ':' ] ? "while" no_struct_literal_expr '{' block '}' ;
```
### Infinite loops
```antlr
loop_expr : [ lifetime ':' ] ? "loop" '{' block '}';
```
### Break expressions
```antlr
break_expr : "break" [ lifetime ] ?;
```
### Continue expressions
```antlr
continue_expr : "continue" [ lifetime ] ?;
```
### For expressions
```antlr
for_expr : [ lifetime ':' ] ? "for" pat "in" no_struct_literal_expr '{' block '}' ;
```
### If expressions
```antlr
if_expr : "if" no_struct_literal_expr '{' block '}'
else_tail ? ;
else_tail : "else" [ if_expr | if_let_expr
| '{' block '}' ] ;
```
### Match expressions
```antlr
match_expr : "match" no_struct_literal_expr '{' match_arm * '}' ;
match_arm : attribute * match_pat "=>" [ expr "," | '{' block '}' ] ;
match_pat : pat [ '|' pat ] * [ "if" expr ] ? ;
```
### If let expressions
```antlr
if_let_expr : "if" "let" pat '=' expr '{' block '}'
else_tail ? ;
```
### While let loops
```antlr
while_let_expr : [ lifetime ':' ] ? "while" "let" pat '=' expr '{' block '}' ;
```
### Return expressions
```antlr
return_expr : "return" expr ? ;
```
# Type system
**FIXME:** is this entire chapter relevant here? Or should it all have been covered by some production already?
## Types
### Primitive types
**FIXME:** grammar?
#### Machine types
**FIXME:** grammar?
#### Machine-dependent integer types
**FIXME:** grammar?
### Textual types
**FIXME:** grammar?
### Tuple types
**FIXME:** grammar?
### Array, and Slice types
**FIXME:** grammar?
### Structure types
**FIXME:** grammar?
### Enumerated types
**FIXME:** grammar?
### Pointer types
**FIXME:** grammar?
### Function types
**FIXME:** grammar?
### Closure types
```antlr
closure_type := [ 'unsafe' ] [ '<' lifetime-list '>' ] '|' arg-list '|'
[ ':' bound-list ] [ '->' type ]
lifetime-list := lifetime | lifetime ',' lifetime-list
arg-list := ident ':' type | ident ':' type ',' arg-list
```
### Never type
An empty type
```antlr
never_type : "!" ;
```
### Object types
**FIXME:** grammar?
### Type parameters
**FIXME:** grammar?
### Type parameter bounds
```antlr
bound-list := bound | bound '+' bound-list '+' ?
bound := ty_bound | lt_bound
lt_bound := lifetime
ty_bound := ty_bound_noparen | (ty_bound_noparen)
ty_bound_noparen := [?] [ for<lt_param_defs> ] simple_path
```
### Self types
**FIXME:** grammar?
## Type kinds
**FIXME:** this is probably not relevant to the grammar...
# Memory and concurrency models
**FIXME:** is this entire chapter relevant here? Or should it all have been covered by some production already?
## Memory model
### Memory allocation and lifetime
### Memory ownership
### Variables
### Boxes
## Threads
### Communication between threads
### Thread lifecycle
[reference]: https://doc.rust-lang.org/reference/
[grammar working group]: https://github.com/rust-lang/wg-grammar

View File

@ -1,3 +0,0 @@
*.class
*.java
*.tokens

View File

@ -1,350 +0,0 @@
%{
#include <stdio.h>
#include <ctype.h>
static int num_hashes;
static int end_hashes;
static int saw_non_hash;
%}
%option stack
%option yylineno
%x str
%x rawstr
%x rawstr_esc_begin
%x rawstr_esc_body
%x rawstr_esc_end
%x byte
%x bytestr
%x rawbytestr
%x rawbytestr_nohash
%x pound
%x shebang_or_attr
%x ltorchar
%x linecomment
%x doc_line
%x blockcomment
%x doc_block
%x suffix
ident [a-zA-Z\x80-\xff_][a-zA-Z0-9\x80-\xff_]*
%%
<suffix>{ident} { BEGIN(INITIAL); }
<suffix>(.|\n) { yyless(0); BEGIN(INITIAL); }
[ \n\t\r] { }
\xef\xbb\xbf {
// UTF-8 byte order mark (BOM), ignore if in line 1, error otherwise
if (yyget_lineno() != 1) {
return -1;
}
}
\/\/(\/|\!) { BEGIN(doc_line); yymore(); }
<doc_line>\n { BEGIN(INITIAL);
yyleng--;
yytext[yyleng] = 0;
return ((yytext[2] == '!') ? INNER_DOC_COMMENT : OUTER_DOC_COMMENT);
}
<doc_line>[^\n]* { yymore(); }
\/\/|\/\/\/\/ { BEGIN(linecomment); }
<linecomment>\n { BEGIN(INITIAL); }
<linecomment>[^\n]* { }
\/\*(\*|\!)[^*] { yy_push_state(INITIAL); yy_push_state(doc_block); yymore(); }
<doc_block>\/\* { yy_push_state(doc_block); yymore(); }
<doc_block>\*\/ {
yy_pop_state();
if (yy_top_state() == doc_block) {
yymore();
} else {
return ((yytext[2] == '!') ? INNER_DOC_COMMENT : OUTER_DOC_COMMENT);
}
}
<doc_block>(.|\n) { yymore(); }
\/\* { yy_push_state(blockcomment); }
<blockcomment>\/\* { yy_push_state(blockcomment); }
<blockcomment>\*\/ { yy_pop_state(); }
<blockcomment>(.|\n) { }
_ { return UNDERSCORE; }
abstract { return ABSTRACT; }
alignof { return ALIGNOF; }
as { return AS; }
become { return BECOME; }
box { return BOX; }
break { return BREAK; }
catch { return CATCH; }
const { return CONST; }
continue { return CONTINUE; }
crate { return CRATE; }
default { return DEFAULT; }
do { return DO; }
else { return ELSE; }
enum { return ENUM; }
extern { return EXTERN; }
false { return FALSE; }
final { return FINAL; }
fn { return FN; }
for { return FOR; }
if { return IF; }
impl { return IMPL; }
in { return IN; }
let { return LET; }
loop { return LOOP; }
macro { return MACRO; }
match { return MATCH; }
mod { return MOD; }
move { return MOVE; }
mut { return MUT; }
offsetof { return OFFSETOF; }
override { return OVERRIDE; }
priv { return PRIV; }
proc { return PROC; }
pure { return PURE; }
pub { return PUB; }
ref { return REF; }
return { return RETURN; }
self { return SELF; }
sizeof { return SIZEOF; }
static { return STATIC; }
struct { return STRUCT; }
super { return SUPER; }
trait { return TRAIT; }
true { return TRUE; }
type { return TYPE; }
typeof { return TYPEOF; }
union { return UNION; }
unsafe { return UNSAFE; }
unsized { return UNSIZED; }
use { return USE; }
virtual { return VIRTUAL; }
where { return WHERE; }
while { return WHILE; }
yield { return YIELD; }
{ident} { return IDENT; }
0x[0-9a-fA-F_]+ { BEGIN(suffix); return LIT_INTEGER; }
0o[0-7_]+ { BEGIN(suffix); return LIT_INTEGER; }
0b[01_]+ { BEGIN(suffix); return LIT_INTEGER; }
[0-9][0-9_]* { BEGIN(suffix); return LIT_INTEGER; }
[0-9][0-9_]*\.(\.|[a-zA-Z]) { yyless(yyleng - 2); BEGIN(suffix); return LIT_INTEGER; }
[0-9][0-9_]*\.[0-9_]*([eE][-\+]?[0-9_]+)? { BEGIN(suffix); return LIT_FLOAT; }
[0-9][0-9_]*(\.[0-9_]*)?[eE][-\+]?[0-9_]+ { BEGIN(suffix); return LIT_FLOAT; }
; { return ';'; }
, { return ','; }
\.\.\. { return DOTDOTDOT; }
\.\. { return DOTDOT; }
\. { return '.'; }
\( { return '('; }
\) { return ')'; }
\{ { return '{'; }
\} { return '}'; }
\[ { return '['; }
\] { return ']'; }
@ { return '@'; }
# { BEGIN(pound); yymore(); }
<pound>\! { BEGIN(shebang_or_attr); yymore(); }
<shebang_or_attr>\[ {
BEGIN(INITIAL);
yyless(2);
return SHEBANG;
}
<shebang_or_attr>[^\[\n]*\n {
// Since the \n was eaten as part of the token, yylineno will have
// been incremented to the value 2 if the shebang was on the first
// line. This yyless undoes that, setting yylineno back to 1.
yyless(yyleng - 1);
if (yyget_lineno() == 1) {
BEGIN(INITIAL);
return SHEBANG_LINE;
} else {
BEGIN(INITIAL);
yyless(2);
return SHEBANG;
}
}
<pound>. { BEGIN(INITIAL); yyless(1); return '#'; }
\~ { return '~'; }
:: { return MOD_SEP; }
: { return ':'; }
\$ { return '$'; }
\? { return '?'; }
== { return EQEQ; }
=> { return FAT_ARROW; }
= { return '='; }
\!= { return NE; }
\! { return '!'; }
\<= { return LE; }
\<\< { return SHL; }
\<\<= { return SHLEQ; }
\< { return '<'; }
\>= { return GE; }
\>\> { return SHR; }
\>\>= { return SHREQ; }
\> { return '>'; }
\x27 { BEGIN(ltorchar); yymore(); }
<ltorchar>static { BEGIN(INITIAL); return STATIC_LIFETIME; }
<ltorchar>{ident} { BEGIN(INITIAL); return LIFETIME; }
<ltorchar>\\[nrt\\\x27\x220]\x27 { BEGIN(suffix); return LIT_CHAR; }
<ltorchar>\\x[0-9a-fA-F]{2}\x27 { BEGIN(suffix); return LIT_CHAR; }
<ltorchar>\\u\{([0-9a-fA-F]_*){1,6}\}\x27 { BEGIN(suffix); return LIT_CHAR; }
<ltorchar>.\x27 { BEGIN(suffix); return LIT_CHAR; }
<ltorchar>[\x80-\xff]{2,4}\x27 { BEGIN(suffix); return LIT_CHAR; }
<ltorchar><<EOF>> { BEGIN(INITIAL); return -1; }
b\x22 { BEGIN(bytestr); yymore(); }
<bytestr>\x22 { BEGIN(suffix); return LIT_BYTE_STR; }
<bytestr><<EOF>> { return -1; }
<bytestr>\\[n\nrt\\\x27\x220] { yymore(); }
<bytestr>\\x[0-9a-fA-F]{2} { yymore(); }
<bytestr>\\u\{([0-9a-fA-F]_*){1,6}\} { yymore(); }
<bytestr>\\[^n\nrt\\\x27\x220] { return -1; }
<bytestr>(.|\n) { yymore(); }
br\x22 { BEGIN(rawbytestr_nohash); yymore(); }
<rawbytestr_nohash>\x22 { BEGIN(suffix); return LIT_BYTE_STR_RAW; }
<rawbytestr_nohash>(.|\n) { yymore(); }
<rawbytestr_nohash><<EOF>> { return -1; }
br/# {
BEGIN(rawbytestr);
yymore();
num_hashes = 0;
saw_non_hash = 0;
end_hashes = 0;
}
<rawbytestr># {
if (!saw_non_hash) {
num_hashes++;
} else if (end_hashes != 0) {
end_hashes++;
if (end_hashes == num_hashes) {
BEGIN(INITIAL);
return LIT_BYTE_STR_RAW;
}
}
yymore();
}
<rawbytestr>\x22# {
end_hashes = 1;
if (end_hashes == num_hashes) {
BEGIN(INITIAL);
return LIT_BYTE_STR_RAW;
}
yymore();
}
<rawbytestr>(.|\n) {
if (!saw_non_hash) {
saw_non_hash = 1;
}
if (end_hashes != 0) {
end_hashes = 0;
}
yymore();
}
<rawbytestr><<EOF>> { return -1; }
b\x27 { BEGIN(byte); yymore(); }
<byte>\\[nrt\\\x27\x220]\x27 { BEGIN(INITIAL); return LIT_BYTE; }
<byte>\\x[0-9a-fA-F]{2}\x27 { BEGIN(INITIAL); return LIT_BYTE; }
<byte>\\u([0-9a-fA-F]_*){4}\x27 { BEGIN(INITIAL); return LIT_BYTE; }
<byte>\\U([0-9a-fA-F]_*){8}\x27 { BEGIN(INITIAL); return LIT_BYTE; }
<byte>.\x27 { BEGIN(INITIAL); return LIT_BYTE; }
<byte><<EOF>> { BEGIN(INITIAL); return -1; }
r\x22 { BEGIN(rawstr); yymore(); }
<rawstr>\x22 { BEGIN(suffix); return LIT_STR_RAW; }
<rawstr>(.|\n) { yymore(); }
<rawstr><<EOF>> { return -1; }
r/# {
BEGIN(rawstr_esc_begin);
yymore();
num_hashes = 0;
saw_non_hash = 0;
end_hashes = 0;
}
<rawstr_esc_begin># {
num_hashes++;
yymore();
}
<rawstr_esc_begin>\x22 {
BEGIN(rawstr_esc_body);
yymore();
}
<rawstr_esc_begin>(.|\n) { return -1; }
<rawstr_esc_body>\x22/# {
BEGIN(rawstr_esc_end);
yymore();
}
<rawstr_esc_body>(.|\n) {
yymore();
}
<rawstr_esc_end># {
end_hashes++;
if (end_hashes == num_hashes) {
BEGIN(INITIAL);
return LIT_STR_RAW;
}
yymore();
}
<rawstr_esc_end>[^#] {
end_hashes = 0;
BEGIN(rawstr_esc_body);
yymore();
}
<rawstr_esc_begin,rawstr_esc_body,rawstr_esc_end><<EOF>> { return -1; }
\x22 { BEGIN(str); yymore(); }
<str>\x22 { BEGIN(suffix); return LIT_STR; }
<str><<EOF>> { return -1; }
<str>\\[n\nr\rt\\\x27\x220] { yymore(); }
<str>\\x[0-9a-fA-F]{2} { yymore(); }
<str>\\u\{([0-9a-fA-F]_*){1,6}\} { yymore(); }
<str>\\[^n\nrt\\\x27\x220] { return -1; }
<str>(.|\n) { yymore(); }
\<- { return LARROW; }
-\> { return RARROW; }
- { return '-'; }
-= { return MINUSEQ; }
&& { return ANDAND; }
& { return '&'; }
&= { return ANDEQ; }
\|\| { return OROR; }
\| { return '|'; }
\|= { return OREQ; }
\+ { return '+'; }
\+= { return PLUSEQ; }
\* { return '*'; }
\*= { return STAREQ; }
\/ { return '/'; }
\/= { return SLASHEQ; }
\^ { return '^'; }
\^= { return CARETEQ; }
% { return '%'; }
%= { return PERCENTEQ; }
<<EOF>> { return 0; }
%%

View File

@ -1,193 +0,0 @@
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
extern int yylex();
extern int rsparse();
#define PUSHBACK_LEN 4
static char pushback[PUSHBACK_LEN];
static int verbose;
void print(const char* format, ...) {
va_list args;
va_start(args, format);
if (verbose) {
vprintf(format, args);
}
va_end(args);
}
// If there is a non-null char at the head of the pushback queue,
// dequeue it and shift the rest of the queue forwards. Otherwise,
// return the token from calling yylex.
int rslex() {
if (pushback[0] == '\0') {
return yylex();
} else {
char c = pushback[0];
memmove(pushback, pushback + 1, PUSHBACK_LEN - 1);
pushback[PUSHBACK_LEN - 1] = '\0';
return c;
}
}
// Note: this does nothing if the pushback queue is full. As long as
// there aren't more than PUSHBACK_LEN consecutive calls to push_back
// in an action, this shouldn't be a problem.
void push_back(char c) {
for (int i = 0; i < PUSHBACK_LEN; ++i) {
if (pushback[i] == '\0') {
pushback[i] = c;
break;
}
}
}
extern int rsdebug;
struct node {
struct node *next;
struct node *prev;
int own_string;
char const *name;
int n_elems;
struct node *elems[];
};
struct node *nodes = NULL;
int n_nodes;
struct node *mk_node(char const *name, int n, ...) {
va_list ap;
int i = 0;
unsigned sz = sizeof(struct node) + (n * sizeof(struct node *));
struct node *nn, *nd = (struct node *)malloc(sz);
print("# New %d-ary node: %s = %p\n", n, name, nd);
nd->own_string = 0;
nd->prev = NULL;
nd->next = nodes;
if (nodes) {
nodes->prev = nd;
}
nodes = nd;
nd->name = name;
nd->n_elems = n;
va_start(ap, n);
while (i < n) {
nn = va_arg(ap, struct node *);
print("# arg[%d]: %p\n", i, nn);
print("# (%s ...)\n", nn->name);
nd->elems[i++] = nn;
}
va_end(ap);
n_nodes++;
return nd;
}
struct node *mk_atom(char *name) {
struct node *nd = mk_node((char const *)strdup(name), 0);
nd->own_string = 1;
return nd;
}
struct node *mk_none() {
return mk_atom("<none>");
}
struct node *ext_node(struct node *nd, int n, ...) {
va_list ap;
int i = 0, c = nd->n_elems + n;
unsigned sz = sizeof(struct node) + (c * sizeof(struct node *));
struct node *nn;
print("# Extending %d-ary node by %d nodes: %s = %p",
nd->n_elems, c, nd->name, nd);
if (nd->next) {
nd->next->prev = nd->prev;
}
if (nd->prev) {
nd->prev->next = nd->next;
}
nd = realloc(nd, sz);
nd->prev = NULL;
nd->next = nodes;
nodes->prev = nd;
nodes = nd;
print(" ==> %p\n", nd);
va_start(ap, n);
while (i < n) {
nn = va_arg(ap, struct node *);
print("# arg[%d]: %p\n", i, nn);
print("# (%s ...)\n", nn->name);
nd->elems[nd->n_elems++] = nn;
++i;
}
va_end(ap);
return nd;
}
int const indent_step = 4;
void print_indent(int depth) {
while (depth) {
if (depth-- % indent_step == 0) {
print("|");
} else {
print(" ");
}
}
}
void print_node(struct node *n, int depth) {
int i = 0;
print_indent(depth);
if (n->n_elems == 0) {
print("%s\n", n->name);
} else {
print("(%s\n", n->name);
for (i = 0; i < n->n_elems; ++i) {
print_node(n->elems[i], depth + indent_step);
}
print_indent(depth);
print(")\n");
}
}
int main(int argc, char **argv) {
if (argc == 2 && strcmp(argv[1], "-v") == 0) {
verbose = 1;
} else {
verbose = 0;
}
int ret = 0;
struct node *tmp;
memset(pushback, '\0', PUSHBACK_LEN);
ret = rsparse();
print("--- PARSE COMPLETE: ret:%d, n_nodes:%d ---\n", ret, n_nodes);
if (nodes) {
print_node(nodes, 0);
}
while (nodes) {
tmp = nodes;
nodes = tmp->next;
if (tmp->own_string) {
free((void*)tmp->name);
}
free(tmp);
}
return ret;
}
void rserror(char const *s) {
fprintf(stderr, "%s\n", s);
}

File diff suppressed because it is too large Load Diff

View File

@ -1,64 +0,0 @@
Rust's lexical grammar is not context-free. Raw string literals are the source
of the problem. Informally, a raw string literal is an `r`, followed by `N`
hashes (where N can be zero), a quote, any characters, then a quote followed
by `N` hashes. Critically, once inside the first pair of quotes,
another quote cannot be followed by `N` consecutive hashes. e.g.
`r###""###"###` is invalid.
This grammar describes this as best possible:
R -> 'r' S
S -> '"' B '"'
S -> '#' S '#'
B -> . B
B -> ε
Where `.` represents any character, and `ε` the empty string. Consider the
string `r#""#"#`. This string is not a valid raw string literal, but can be
accepted as one by the above grammar, using the derivation:
R : #""#"#
S : ""#"
S : "#
B : #
B : ε
(Where `T : U` means the rule `T` is applied, and `U` is the remainder of the
string.) The difficulty arises from the fact that it is fundamentally
context-sensitive. In particular, the context needed is the number of hashes.
To prove that Rust's string literals are not context-free, we will use
the fact that context-free languages are closed under intersection with
regular languages, and the
[pumping lemma for context-free languages](https://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages).
Consider the regular language `R = r#+""#*"#+`. If Rust's raw string literals are
context-free, then their intersection with `R`, `R'`, should also be context-free.
Therefore, to prove that raw string literals are not context-free,
it is sufficient to prove that `R'` is not context-free.
The language `R'` is `{r#^n""#^m"#^n | m < n}`.
Assume `R'` *is* context-free. Then `R'` has some pumping length `p > 0` for which
the pumping lemma applies. Consider the following string `s` in `R'`:
`r#^p""#^{p-1}"#^p`
e.g. for `p = 2`: `s = r##""#"##`
Then `s = uvwxy` for some choice of `uvwxy` such that `vx` is non-empty,
`|vwx| < p+1`, and `uv^iwx^iy` is in `R'` for all `i >= 0`.
Neither `v` nor `x` can contain a `"` or `r`, as the number of these characters
in any string in `R'` is fixed. So `v` and `x` contain only hashes.
Consequently, of the three sequences of hashes, `v` and `x` combined
can only pump two of them.
If we ever choose the central sequence of hashes, then one of the outer sequences
will not grow when we pump, leading to an imbalance between the outer sequences.
Therefore, we must pump both outer sequences of hashes. However,
there are `p+2` characters between these two sequences of hashes, and `|vwx|` must
be less than `p+1`. Therefore we have a contradiction, and `R'` must not be
context-free.
Since `R'` is not context-free, it follows that the Rust's raw string literals
must not be context-free.

View File

@ -1,66 +0,0 @@
#!/usr/bin/env python
# ignore-tidy-linelength
import sys
import os
import subprocess
import argparse
# usage: testparser.py [-h] [-p PARSER [PARSER ...]] -s SOURCE_DIR
# Parsers should read from stdin and return exit status 0 for a
# successful parse, and nonzero for an unsuccessful parse
parser = argparse.ArgumentParser()
parser.add_argument('-p', '--parser', nargs='+')
parser.add_argument('-s', '--source-dir', nargs=1, required=True)
args = parser.parse_args(sys.argv[1:])
total = 0
ok = {}
bad = {}
for parser in args.parser:
ok[parser] = 0
bad[parser] = []
devnull = open(os.devnull, 'w')
print("\n")
for base, dirs, files in os.walk(args.source_dir[0]):
for f in filter(lambda p: p.endswith('.rs'), files):
p = os.path.join(base, f)
parse_fail = 'parse-fail' in p
if sys.version_info.major == 3:
lines = open(p, encoding='utf-8').readlines()
else:
lines = open(p).readlines()
if any('ignore-test' in line or 'ignore-lexer-test' in line for line in lines):
continue
total += 1
for parser in args.parser:
if subprocess.call(parser, stdin=open(p), stderr=subprocess.STDOUT, stdout=devnull) == 0:
if parse_fail:
bad[parser].append(p)
else:
ok[parser] += 1
else:
if parse_fail:
ok[parser] += 1
else:
bad[parser].append(p)
parser_stats = ', '.join(['{}: {}'.format(parser, ok[parser]) for parser in args.parser])
sys.stdout.write("\033[K\r total: {}, {}, scanned {}"
.format(total, os.path.relpath(parser_stats), os.path.relpath(p)))
devnull.close()
print("\n")
for parser in args.parser:
filename = os.path.basename(parser) + '.bad'
print("writing {} files that did not yield the correct result with {} to {}".format(len(bad[parser]), parser, filename))
with open(filename, "w") as f:
for p in bad[parser]:
f.write(p)
f.write("\n")

View File

@ -1,99 +0,0 @@
enum Token {
SHL = 257, // Parser generators reserve 0-256 for char literals
SHR,
LE,
EQEQ,
NE,
GE,
ANDAND,
OROR,
SHLEQ,
SHREQ,
MINUSEQ,
ANDEQ,
OREQ,
PLUSEQ,
STAREQ,
SLASHEQ,
CARETEQ,
PERCENTEQ,
DOTDOT,
DOTDOTDOT,
MOD_SEP,
LARROW,
RARROW,
FAT_ARROW,
LIT_BYTE,
LIT_CHAR,
LIT_INTEGER,
LIT_FLOAT,
LIT_STR,
LIT_STR_RAW,
LIT_BYTE_STR,
LIT_BYTE_STR_RAW,
IDENT,
UNDERSCORE,
LIFETIME,
// keywords
SELF,
STATIC,
ABSTRACT,
ALIGNOF,
AS,
BECOME,
BREAK,
CATCH,
CRATE,
DEFAULT,
DO,
ELSE,
ENUM,
EXTERN,
FALSE,
FINAL,
FN,
FOR,
IF,
IMPL,
IN,
LET,
LOOP,
MACRO,
MATCH,
MOD,
MOVE,
MUT,
OFFSETOF,
OVERRIDE,
PRIV,
PUB,
PURE,
REF,
RETURN,
SIZEOF,
STRUCT,
SUPER,
UNION,
TRUE,
TRAIT,
TYPE,
UNSAFE,
UNSIZED,
USE,
VIRTUAL,
WHILE,
YIELD,
CONTINUE,
PROC,
BOX,
CONST,
WHERE,
TYPEOF,
INNER_DOC_COMMENT,
OUTER_DOC_COMMENT,
SHEBANG,
SHEBANG_LINE,
STATIC_LIFETIME
};