Updated Application overview (markdown)
@@ -24,7 +24,7 @@ Apart from mentioned data, each token also includes some metadata, like location
|
||||
|
||||
You can check what tokens are produced for arbitrary input code using --tokens flag, for example:
|
||||
```
|
||||
smnp --tokens --dry-run -c "[1, 2, 3] as i ^ println(\"Current: \" + i.toString());"
|
||||
$ smnp --tokens --dry-run -c "[1, 2, 3] as i ^ println(\"Current: \" + i.toString());"
|
||||
size: 21
|
||||
current: 0 -> (open_square, »[«, 1:1)
|
||||
all: [(open_square, »[«, 1:1), (integer, »1«, 1:2), (comma, »,«, 1:3), (integer, »2«, 1:5), (comma, »,«, 1:6), (integer, »3«, 1:8), (close_square, »]«, 1:9), (as, »as«, 1:11), (identifier, »i«, 1:14), (caret, »^«, 1:16), (identifier, »println«, 1:18), (open_paren, »(«, 1:25), (string, »"Current: "«, 1:26), (plus, »+«, 1:38), (identifier, »i«, 1:40), (dot, ».«, 1:41), (identifier, »toString«, 1:42), (open_paren, »(«, 1:50), (close_paren, »)«, 1:51), (close_paren, »)«, 1:52), (semicolon, »;«, 1:53)]
|
||||
@@ -32,11 +32,115 @@ all: [(open_square, »[«, 1:1), (integer, »1«, 1:2), (comma, »,«, 1:3), (in
|
||||
|
||||
Tokenizer tries to match input with all available patterns, sticking with rule first-match. That means if there is more than one patterns that match input, only first will be applied. This is why you can't for example name your variables or functions/methods with keywords. Take a look at the output of following command:
|
||||
```
|
||||
smnp --tokens --dry-run -c "function = 14;"
|
||||
$ smnp --tokens --dry-run -c "function = 14;"
|
||||
size: 4
|
||||
current: 0 -> (function, »function«, 1:1)
|
||||
all: [(function, »function«, 1:1), (assign, »=«, 1:10), (integer, »14«, 1:12), (semicolon, »;«, 1:14)]
|
||||
```
|
||||
The first token has type of `function`, not `identifier` which is expected for assignment operation.
|
||||
|
||||
All tokenizer-related code is located in `io.smnp.dsl.token` module.
|
||||
All tokenizer-related code is located in `io.smnp.dsl.token` module.
|
||||
|
||||
# Parser
|
||||
Parser is the next stage of code processing pipeline.
|
||||
It takes input from tokenizer and tries to compose a tree (called
|
||||
AST, which stands for **a**bstract **s**yntax **t**ree) basing on known rules, which are called *productions*.
|
||||
As long as tokenizer defines language's *alphabet*, i.e. a set
|
||||
of available terminals, parser defines *grammar* of that language.
|
||||
It means that tokenizer can for example detect unknown character
|
||||
or sequence of characters meanwhile parser is able to detect unknown
|
||||
constructions built with known tokens.
|
||||
|
||||
A good example is the last snippet from [[Application overview#Tokenizer]] section:
|
||||
```
|
||||
$ smnp --tokens --dry-run -c "function = 14;"
|
||||
size: 4
|
||||
current: 0 -> (function, »function«, 1:1)
|
||||
all: [(function, »function«, 1:1), (assign, »=«, 1:10), (integer, »14«, 1:12), (semicolon, »;«, 1:14)]
|
||||
Syntax error
|
||||
Source: <inline>
|
||||
Position: line 1, column 10
|
||||
|
||||
Expected function/method name, got '='
|
||||
```
|
||||
You can see, that tokenizer has successfully done his job,
|
||||
but parser throw a syntax error saying that it does not know
|
||||
any production that could (directly or indirectly) match
|
||||
`function assign integer semicolon` sequence.
|
||||
|
||||
You can check AST produced for arbitrary input code
|
||||
using `--ast` flag, for example:
|
||||
```
|
||||
smnp --ast --dry-run -c "[1, 2, 3] as i ^ println(\"Current: \" + i.toString());"
|
||||
RootNode 1:16
|
||||
└─LoopNode 1:16
|
||||
├─ListNode 1:1
|
||||
│ ├─IntegerLiteralNode 1:2
|
||||
│ │ └ (integer, »1«, 1:2)
|
||||
│ ├─IntegerLiteralNode 1:5
|
||||
│ │ └ (integer, »2«, 1:5)
|
||||
│ └─IntegerLiteralNode 1:8
|
||||
│ └ (integer, »3«, 1:8)
|
||||
├─LoopParametersNode 1:14
|
||||
│ └─IdentifierNode 1:14
|
||||
│ └ (identifier, »i«, 1:14)
|
||||
├─FunctionCallNode 1:18
|
||||
│ ├─IdentifierNode 1:18
|
||||
│ │ └ (identifier, »println«, 1:18)
|
||||
│ └─FunctionCallArgumentsNode 1:25
|
||||
│ └─SumOperatorNode 1:38
|
||||
│ ├─StringLiteralNode 1:26
|
||||
│ │ └ (string, »"Current: "«, 1:26)
|
||||
│ ├─TokenNode 1:38
|
||||
│ │ └ (plus, »+«, 1:38)
|
||||
│ └─AccessOperatorNode 1:41
|
||||
│ ├─IdentifierNode 1:40
|
||||
│ │ └ (identifier, »i«, 1:40)
|
||||
│ ├─TokenNode 1:41
|
||||
│ │ └ (dot, ».«, 1:41)
|
||||
│ └─FunctionCallNode 1:42
|
||||
│ ├─IdentifierNode 1:42
|
||||
│ │ └ (identifier, »toString«, 1:42)
|
||||
│ └─FunctionCallArgumentsNode 1:50
|
||||
└─NoneNode 0:0
|
||||
```
|
||||
|
||||
Technically SMNP does have **LL(1)** parser implemented.
|
||||
The acronym means:
|
||||
* input is read from **L**eft to right
|
||||
* parser produces a **L**eft-to-right derivation
|
||||
* parser uses one lookahead token.
|
||||
Even though this kind of parsers is treated as the least sophisticated, in most cases
|
||||
they do the job and are enough even for more advanced use cases.
|
||||
|
||||
SMNP language parser has some fundamental helper function that provides
|
||||
something like construction blocks that are used in right production
|
||||
rules implementations. SMNP language parser actually is a combination
|
||||
of sub-parsers that are able to parse subset of language.
|
||||
|
||||
For example `io.smnp.dsl.ast.parser.AtomParser` defines a parser related to parsing
|
||||
atomic values, like literals and so on (note also that *expression* with parentheses
|
||||
on both sides is treated like atom):
|
||||
```kotlin
|
||||
class AtomParser : Parser() {
|
||||
override fun tryToParse(input: TokenList): ParserOutput {
|
||||
val parenthesesParser = allOf(
|
||||
terminal(TokenType.OPEN_PAREN),
|
||||
ExpressionParser(),
|
||||
terminal(TokenType.CLOSE_PAREN)
|
||||
) { (_, expression) -> expression }
|
||||
|
||||
val literalParser = oneOf(
|
||||
parenthesesParser,
|
||||
ComplexIdentifierParser(),
|
||||
StaffParser(),
|
||||
ListParser(),
|
||||
LiteralParser(),
|
||||
MapParser()
|
||||
)
|
||||
|
||||
return literalParser.parse(input)
|
||||
}
|
||||
}
|
||||
```
|
||||
In this example you can notice both `allOf()` and `oneOf()` helper methods. The first one returns success (and parsed node) if and only if all of its subparsers returns success as well. In contrast to that, the `oneOf()` method returns success with parsed node when any of its subparsers returns success. The `oneOf()` method seeks for the first parser that returns success. When it finds it, it immediately returns success with node returned from its subparser and does not execute further subparsers. Because the `oneOf()` method is only a proxy for other parsers, it does not need to do anything with returned nodes. In contrast to that, the `allOf()` method has to compose every node returned from its subparsers to new node. Thanks to that, we can obtain AST instead of CST (concrete syntax tree).
|
||||
Reference in New Issue
Block a user