Logos Handbook
Hi there!
Logos is a fast and easy to use lexer generator written in Rust. While Rust has excellent documentation tools (and you can access the API docs for Logos at docs.rs), it's not the easiest thing to document custom syntax used by procedural macros, of which Logos has a bit. This Handbook seeks to remedy this!
In a nut shell
There are two main types in Logos:
- The
Logos
trait, which comes out with it's own derive macro. The derive macro uses custom attributes (the things using these brackets:#[...]
) with plain string or regular expression syntax onenum
variants as patterns for some input. - The
Lexer<T: Logos>
, which is an iterator that takes some input (&str
, sometimes&[u8]
) and performs lexical analysis on the input on the go, producing variants of the enumT
matching the defined patterns.
Getting Started
Logos can be included in your Rust project using the cargo add logos
command, or by directly modifying your Cargo.toml
file:
[dependencies]
logos = "0.15.0"
Then, you can automatically derive the Logos
trait on your enum
using the Logos
derive macro:
#![allow(unused)] fn main() { use logos::Logos; #[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] // Ignore this regex pattern between tokens enum Token { // Tokens can be literal strings, of any length. #[token("fast")] Fast, #[token(".")] Period, // Or regular expressions. #[regex("[a-zA-Z]+")] Text, } }
Then, you can use Logos::lexer
method to turn any &str
into an iterator of tokens1:
#![allow(unused)] fn main() { let mut lex = Token::lexer("Create ridiculously fast Lexers."); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.span(), 0..6); assert_eq!(lex.slice(), "Create"); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.span(), 7..19); assert_eq!(lex.slice(), "ridiculously"); assert_eq!(lex.next(), Some(Ok(Token::Fast))); assert_eq!(lex.span(), 20..24); assert_eq!(lex.slice(), "fast"); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.slice(), "Lexers"); assert_eq!(lex.span(), 25..31); assert_eq!(lex.next(), Some(Ok(Token::Period))); assert_eq!(lex.span(), 31..32); assert_eq!(lex.slice(), "."); assert_eq!(lex.next(), None); }
Each item is actually a Result<Token, _>
, because the lexer returns an error if some part of the string slice does not match any variant of Token
.
Because Lexer
, returned by Logos::lexer
, implements the Iterator
trait, you can use a for .. in
construct:
#![allow(unused)] fn main() { for result in Token::lexer("Create ridiculously fast Lexers.") { match result { Ok(token) => println!("{:#?}", token), Err(e) => panic!("some error occurred: {}", e), } } }
Getting Help
If you need help using Logos, there are three places you can go to depending on what you are looking for:
- this book for a documented walk through Logos' usage, with detailed examples, and more. A must read for any newcomer;
- the API documentation to obtain precise information about function signatures and what the Logos crate exposes in terms of features;
- and GitHub issues for anything else that is not covered by any of the two above.
Regarding GitHub issues, it's highly recommended to first check if another issue, either open or closed, already covers the topic you are looking for. If not, then consider creating a new issue with necessary information about your question, problem or else.
Attributes
The #[derive(Logos)]
procedural macro recognizes three different attribute
names.
#[logos]
is the main attribute which can be attached to theenum
of your token definition. It allows you to define theExtras
associated type in order to put custom state into theLexer
, or declare concrete types for generic type parameters, if yourenum
uses such. It is strictly optional. It also allows to define parts that must be skipped by the lexer, the error type, or regex subpatterns.- And most importantly the
#[token]
and#[regex]
attributes. Those allow you to define patterns to match against the input, either plain text strings with#[token]
, or using regular expression syntax with#[regex]
. Aside from that difference, they are equivalent, and any extra arguments you can pass to one, you can pass to the other.
#[logos]
As previously said, the #[logos]
attribute can be attached to the enum
of your token definition to customize your lexer. Note that they all are
optional.
The syntax is as follows:
#![allow(unused)] fn main() { #[derive(Logos)] #[logos(skip "regex literal")] #[logos(extras = ExtrasType)] #[logos(error = ErrorType)] #[logos(crate = path::to::logos)] #[logos(source = SourceType)] enum Token { /* ... */ } }
where "regex literal"
can be any regex supported by
#[regex]
, and ExtrasType
can be of any type!
An example usage of skip
is provided in the JSON parser example.
For more details about extras, read the eponym section.
Custom error type
By default, Logos uses ()
as the error type, which means that it
doesn't store any information about the error.
This can be changed by using #[logos(error = ErrorType)]
attribute on the enum.
The type ErrorType
can be any type that implements Clone
, PartialEq
,
Default
and From<E>
for each callback's error type.
ErrorType
must implement the Default
trait because invalid tokens, i.e.,
literals that do not match any variant, will produce Err(ErrorType::default())
.
For example, here is an example using a custom error type:
use logos::Logos;
use std::num::ParseIntError;
#[derive(Default, Debug, Clone, PartialEq)]
enum LexingError {
InvalidInteger(String),
#[default]
NonAsciiCharacter,
}
/// Error type returned by calling `lex.slice().parse()` to u8.
impl From<ParseIntError> for LexingError {
fn from(err: ParseIntError) -> Self {
use std::num::IntErrorKind::*;
match err.kind() {
PosOverflow | NegOverflow => LexingError::InvalidInteger("overflow error".to_owned()),
_ => LexingError::InvalidInteger("other error".to_owned()),
}
}
}
#[derive(Debug, Logos, PartialEq)]
#[logos(error = LexingError)]
#[logos(skip r"[ \t]+")]
enum Token {
#[regex(r"[a-zA-Z]+")]
Word,
#[regex(r"[0-9]+", |lex| lex.slice().parse())]
Integer(u8),
}
fn main() {
// 256 overflows u8, since u8's max value is 255.
// 'é' is not a valid ascii letter.
let mut lex = Token::lexer("Hello 256 Jérome");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "Hello");
assert_eq!(
lex.next(),
Some(Err(LexingError::InvalidInteger(
"overflow error".to_owned()
)))
);
assert_eq!(lex.slice(), "256");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "J");
assert_eq!(lex.next(), Some(Err(LexingError::NonAsciiCharacter)));
assert_eq!(lex.slice(), "é");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "rome");
assert_eq!(lex.next(), None);
}
You can add error variants to LexingError
,
and implement From<E>
for each error type E
that could
be returned by a callback. See callbacks.
Specifying path to logos
You can force the derive macro to use a different path to Logos
's crate
with #[logos(crate = path::to::logos)]
.
Custom source type
By default, Logos's lexer will accept &str
as input, unless any of the
pattern literals match a non utf-8 bytes sequence. In this case, it will fall
back to &[u8]
. You can override this behavior by forcing one of the two
source types. You can also specify any custom time that implements
Source
.
#[token]
and #[regex]
For each variant your declare in your enum
that uses the Logos
derive macro,
you can specify one or more string literal or regex it can match.
The usage syntax is a follows:
#![allow(unused)] fn main() { #[derive(Logos)] enum Token { #[token(literal [, callback, priority = <integer>, ignore(<flag>, ...)]] #[regex(literal [, callback, priority = <integer>, ignore(<flag>, ...)]] SomeVariant, } }
where literal
can be any &str
or &[u8]
string literal,
callback
can either be a closure, or a literal path to a function
(see Using callbacks section),
priority
can be any positive integer
(see Token disambiguation section),
and flag
can by of: case
, ascii_case
. Only literal
is required,
others are optional.
You can stack any number of #[token]
and or #[regex]
attributes on top of
the same variant.
For a list of supported regex
literals, read the
Common regular expressions section.
Token disambiguation
When two or more tokens can match a given sequence, Logos compute the
priority of each pattern (#[token]
or #[regex]
), and use that priority
to decide which pattern should match.
The rule of thumb is:
- Longer beats shorter.
- Specific beats generic.
If any two definitions could match the same input, like fast
and [a-zA-Z]+
in the example above, it's the longer and more specific definition of Token::Fast
that will be the result.
This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative:
[a-zA-Z]+
has a priority of 2 (lowest possible), because at minimum it can match a single byte to a class;foobar
has a priority of 12;- and
(foo|hello)(bar)?
has a priority of 6,foo
being it's shortest possible match.
Generally speaking, equivalent regex patterns have the same priority. E.g.,
a|b
is equivalent to [a-b]
, and both have a priority of 2.
When two different patterns have the same priority,
Logos will issue an compilation error.
To prevent this from happening, you can manually set the priority of a given
pattern with, e.g., #token("foobar", priority = 20)
.
Using Extras
When deriving the Logos
traits, you may want to convey some internal state
between your tokens. That is where Logos::Extras
comes to the rescue.
Each Lexer
has a public field called extras
that can be accessed and
mutated to keep track and modify some internal state. By default,
this field is set to ()
, but its type can by modified using the derive
attribute #[logos(extras = <some type>)]
on your enum
declaration.
For example, one may want to know the location, both line and column indices, of each token. This is especially useful when one needs to report an erroneous token to the user, in an user-friendly manner.
/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
#[regex(r"\n", newline_callback)]
Newline,
#[regex(r"\w+", word_callback)]
Word((usize, usize)),
}
The above token definition will hold two tokens: Newline
and Word
.
The former is only used to keep track of the line numbering and will be skipped
using Skip
as a return value from its callback function. The latter will be
a word with (line, column)
indices.
To make it easy, the lexer will contain the following two extras:
extras.0
: the line number;extras.1
: the char index of the current line.
We now have to define the two callback functions:
/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
lex.extras.0 += 1;
lex.extras.1 = lex.span().end;
Skip
}
/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
let line = lex.extras.0;
let column = lex.span().start - lex.extras.1;
(line, column)
}
Extras can of course be used for more complicate logic, and there is no limit
to what you can store within the public extras
field.
Finally, we provide you the full code that you should be able to run with1:
cargo run --example extras Cargo.toml
1 You first need to clone this repository.
use logos::{Lexer, Logos, Skip};
use std::env;
use std::fs;
/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
lex.extras.0 += 1;
lex.extras.1 = lex.span().end;
Skip
}
/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
let line = lex.extras.0;
let column = lex.span().start - lex.extras.1;
(line, column)
}
/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
#[regex(r"\n", newline_callback)]
Newline,
#[regex(r"\w+", word_callback)]
Word((usize, usize)),
}
fn main() {
let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
.expect("Failed to read file");
let mut lex = Token::lexer(src.as_str());
while let Some(token) = lex.next() {
if let Ok(Token::Word((line, column))) = token {
println!("Word '{}' found at ({}, {})", lex.slice(), line, column);
}
}
}
Using callbacks
Logos can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant:
use logos::{Logos, Lexer}; // Note: callbacks can return `Option` or `Result` fn kilo(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k' Some(n * 1_000) } fn mega(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm' Some(n * 1_000_000) } #[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] enum Token { // Callbacks can use closure syntax, or refer // to a function defined elsewhere. // // Each pattern can have it's own callback. #[regex("[0-9]+", |lex| lex.slice().parse().ok())] #[regex("[0-9]+k", kilo)] #[regex("[0-9]+m", mega)] Number(u64), } fn main() { let mut lex = Token::lexer("5 42k 75m"); assert_eq!(lex.next(), Some(Ok(Token::Number(5)))); assert_eq!(lex.slice(), "5"); assert_eq!(lex.next(), Some(Ok(Token::Number(42_000)))); assert_eq!(lex.slice(), "42k"); assert_eq!(lex.next(), Some(Ok(Token::Number(75_000_000)))); assert_eq!(lex.slice(), "75m"); assert_eq!(lex.next(), None); }
Logos can handle callbacks with following return types:
Return type | Produces |
---|---|
() | Ok(Token::Unit) |
bool | Ok(Token::Unit) or Err(<Token as Logos>::Error::default()) |
Result<(), E> | Ok(Token::Unit) or Err(<Token as Logos>::Error::from(err)) |
T | Ok(Token::Value(T)) |
Option<T> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::default()) |
Result<T, E> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) |
Skip | skips matched input |
Result<Skip, E> | skips matched input or Err(<Token as Logos>::Error::from(err)) |
Filter<T> | Ok(Token::Value(T)) or skips matched input |
FilterResult<T, E> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) or skips matched input |
Callbacks can be also used to do perform more specialized lexing in place
where regular expressions are too limiting. For specifics look at
Lexer::remainder
and
Lexer::bump
.
Common regular expressions
Maybe the most important feature of Logos is its ability to accept regex patterns in your tokens' definition.
Regular expressions,
or regexes for short, are sequences of characters (or bytes) that define a match
pattern. When constructing lexers, this is especially useful to define tokens
that should match a set of similar literals. E.g., a sequence of
3 ASCII uppercase letters and 3 digits could define a license plate,
and could be matched with the following regex: "[A-Z]{3}[0-9]{3}"
.
For more details about regexes in Rust, refer to the regex crate.
Valid regexes that are not supported
Because Logos aims at generating high-performance code, it never allows to do backtracking. This means that anytime a byte is read from the input source, it will never be read again. This implementation choice comes at a cost: not all valid regexes are supported by Logos1.
For reference, Logos parses regexes using regex-syntax = 0.6
, and
transforms its high-level intermediate representation (HIR) into some
medium intermediate representation (MIR). From HIR, MIR does not support
the following
HirKind
s:
- Non-greedy repetitions, i.e., matching as little as possible as given pattern.
".*"
and".+"
repetition patterns, because they will potentially consume all the input source, breaking the non-backtracking rule. For solutions, see footnote1 or read the error message.- Word boundaries, i.e., r
"\b"
. - Anchors, because input source does not treat lines separately.
Additionally, note that capture groups will silently be ungrouped,
because Logos does not support capturing groups, but the main slice
(lex.slice()
).
Most of time, however, it is possible to circumvent this issue by rewriting your regex another way, or by using callbacks. E.g., see #302.
Other issues
Logos' support for regexes is not yet complete, and errors can still exist. Some are found at compile time, and others will create wrong matches or panic.
If you ever feel like your patterns do not match the expected source slices, please check the GitHub issues. If no issue covers your problem, we encourage you to create a new issue, and document it as best as you can so that the issue can be reproduced locally.
Debugging
Instructions on how to debug your Logos lexer.
Visualizing Logos Graph
Logos works by creating a graph that gets derived from the tokens that you defined. This graph describes how the lexer moves through different states when processing input.
Hence, it may be beneficial during debugging to be able to visualize this graph, to understand how Logos will match the various tokens.
If we take this example:
use logos::Logos;
#[derive(Debug, Logos, PartialEq)]
enum Token {
// Tokens can be literal strings, of any length.
#[token("fast")]
Fast,
#[token(".")]
Period,
// Or regular expressions.
#[regex("[a-zA-Z]+")]
Text,
}
fn main() {
let input = "Create ridiculously fast Lexers.";
let mut lexer = Token::lexer(input);
while let Some(token) = lexer.next() {
println!("{:?}", token);
}
}
Logos actually constructs a graph that contains the logic for matching tokens:
graph = {
1: ::Fast,
2: ::Period,
3: ::Text,
4: {
[A-Z] ⇒ 4,
[a-z] ⇒ 4,
_ ⇒ 3,
},
7: [
ast ⇒ 8,
_ ⇒ 4*,
],
8: {
[A-Z] ⇒ 4,
[a-z] ⇒ 4,
_ ⇒ 1,
},
9: {
. ⇒ 2,
[A-Z] ⇒ 4,
[a-e] ⇒ 4,
f ⇒ 7,
[g-z] ⇒ 4,
},
}
This graph can help us understand how our patterns are matched, and maybe understand why we have a bug at some point.
Let's get started by trying to understand how Logos is matching the
.
character, which we've tokenized as Token::Period
.
We can begin our search by looking at number 9
for the character .
.
We can see that if Logos matches a .
it will jump =>
to number 2
.
We can then follow that by looking at 2
which resolves to our ::Period
token.
Logos will then continue to look for any matches past our .
character.
This is required in case there is potential continuation after the .
character.
Although, in the input we provided, there are no any additional characters,
since it is the end of our input.
We also can try to identify how the token fast
works by looking at 9
,
first, and seeing that f
will cause Logos to jump to 7
.
This will then resolve the last letters of our word fast by matching ast
which jumps to 8
. Since our provided input to the lexer does not include
alphabetic characters after the word "fast", but rather a whitespace,
the token ::Fast
will be recognized.
Then, the graph will look for further potential continuation (here, [g-z] => 4
)
Enabling
To enable debugging output you can define a debug
feature in your
Cargo.toml
file, like this:
// Cargo.toml
[dependencies]
logos = { version = "1.2.3", features = ["debug"] }
Next, you can build your project with cargo build
and
the output will contain a debug representation of your graph(s).
Unsafe Code
By default, Logos uses unsafe code to avoid unnecessary bounds checks while
accessing slices of the input Source
.
This unsafe code also exists in the code generated by the Logos
derive macro,
which generates a deterministic finite automata (DFA). Reasoning about the correctness
of this generated code can be difficult - if the derivation of the DFA in Logos
is correct, then this generated code will be correct and any mistakes in implementation
would be caught given sufficient fuzz testing.
Use of unsafe code is the default as this typically provides the fastest parser.
Disabling Unsafe Code
However, for applications accepting untrusted input in a trusted context, this may not be a sufficient correctness justification.
For those applications which cannot tolerate unsafe code, the feature forbid-unsafe
may be enabled. This replaces unchecked accesses in the Logos
crate with safe,
checked alternatives which will panic on out-of-bounds access rather than cause
undefined behavior. Additionally, code generated by the macro will not use the
unsafe keyword, so generated code may be used in a crates using the
#![forbid(unsafe_code)]
attribute.
When the forbid-unsafe
feature is added to a direct dependency on the Logos
crate,
Feature Unification
ensures any transitive inclusion of Logos
via other dependencies also have unsafe
code disabled.
Generally, disabling unsafe code will result in a slower parser.
However making definitive statements around performance of safe-only code is difficult, as there are too many variables to consider between compiler optimizations, the specific grammar being parsed, and the target processor. The automated benchmarks of this crate show around a 10% slowdown in safe-only code at the time of this writing.
Examples
The following examples are ordered by increasing level of complexity.
Brainfuck interpreter: Lexers are very powerful tools for parsing code programs into meaningful instructions. We show you how you can build an interpreter for the Brainfuck programming language under 100 lines of code!
Simple calculator: For a relatively large domain-specifc language (DSL), or any programming language, implementing an interpreter typically involves converting the tokens generated by a lexer into an abstract syntax tree (AST) via a parser, and then evaluating it. We show you how you can build a simple calculator that evaluates arithmetic expressions by combining Logos and a parser generator library.
JSON parser: We present a JSON parser written with Logos that does nice error reporting when invalid values are encountered.
JSON-borrowed parser: A variant of the previous parser, but that does not own its data.
Brainfuck interpreter
In most programming languages, commands can be made of multiple program tokens, where a token is simply string slice that has a particular meaning for the language. For example, in Rust, the function signature pub fn main()
could be split by the lexer into tokens pub
, fn
, main
, (
, and )
. Then, the parser combines tokens into meaningful program instructions.
However, there exists programming languages that are so simple, such as Brainfuck, that each token can be mapped to a single instruction. There are actually 8 single-characters tokens:
/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
/// Increment pointer.
#[token(">")]
IncPointer,
/// Decrement pointer.
#[token("<")]
DecPointer,
/// Increment data at pointer.
#[token("+")]
IncData,
/// Decrement data at pointer.
#[token("-")]
DecData,
/// Output data at pointer.
#[token(".")]
OutData,
/// Input (read) to data at pointer.
#[token(",")]
InpData,
/// Conditionally jump to matching `']'`.
#[token("[")]
CondJumpForward,
/// Conditionally jump to matching `'['`.
#[token("]")]
CondJumpBackward,
}
All other characters must be ignored.
Once the tokens are obtained, a Brainfuck interpreter can be easily created using a Finite-state machine. For the sake of simpliciy, we collected all the tokens into one vector called operations
.
Now, creating an interpreter becomes straightforward1:
let mut i: usize = 0;
// True program execution.
loop {
match operations[i] {
Op::IncPointer => pointer += 1,
Op::DecPointer => pointer -= 1,
Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
Op::OutData => print_byte(data[pointer]),
Op::InpData => data[pointer] = read_byte(),
Op::CondJumpForward => {
if data[pointer] == 0 {
// Skip until matching end.
i = *pairs.get(&i).unwrap();
}
}
Op::CondJumpBackward => {
if data[pointer] != 0 {
// Go back to matching start.
i = *pairs_reverse.get(&i).unwrap();
}
}
}
i += 1;
if i >= len {
break;
}
}
There is a small trick to make it easy. As it can be seen in the full code, we first perform a check that all beginning loops ('['
) have a matching end (']'
). This way, we can create two maps, pairs
and pairs_reverse
, to easily jump back and forth between them.
Finally, we provide you the full code that you should be able to run with2:
cargo run --example brainfuck examples/hello_word.bf
You first need to clone this repository.
use logos::Logos;
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::{self, Read};
/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
/// Increment pointer.
#[token(">")]
IncPointer,
/// Decrement pointer.
#[token("<")]
DecPointer,
/// Increment data at pointer.
#[token("+")]
IncData,
/// Decrement data at pointer.
#[token("-")]
DecData,
/// Output data at pointer.
#[token(".")]
OutData,
/// Input (read) to data at pointer.
#[token(",")]
InpData,
/// Conditionally jump to matching `']'`.
#[token("[")]
CondJumpForward,
/// Conditionally jump to matching `'['`.
#[token("]")]
CondJumpBackward,
}
/// Print one byte to the terminal.
#[inline(always)]
fn print_byte(byte: u8) {
print!("{}", byte as char);
}
/// Read one byte from the terminal.
#[inline(always)]
fn read_byte() -> u8 {
let mut input = [0u8; 1];
io::stdin()
.read_exact(&mut input)
.expect("An error occurred while reading byte!");
input[0]
}
/// Execute Brainfuck code from a string slice.
pub fn execute(code: &str) {
let operations: Vec<_> = Op::lexer(code).filter_map(|op| op.ok()).collect();
let mut data = [0u8; 30_000]; // Minimum recommended size
let mut pointer: usize = 0;
let len = operations.len();
// We pre-process matching jump commands, and we create
// a mapping between them.
let mut queue = Vec::new();
let mut pairs = HashMap::new();
let mut pairs_reverse = HashMap::new();
for (i, op) in operations.iter().enumerate() {
match op {
Op::CondJumpForward => queue.push(i),
Op::CondJumpBackward => {
if let Some(start) = queue.pop() {
pairs.insert(start, i);
pairs_reverse.insert(i, start);
} else {
panic!(
"Unexpected conditional backward jump at position {}, does not match any '['",
i
);
}
}
_ => (),
}
}
if !queue.is_empty() {
panic!("Unmatched conditional forward jump at positions {:?}, expecting a closing ']' for each of them", queue);
}
let mut i: usize = 0;
// True program execution.
loop {
match operations[i] {
Op::IncPointer => pointer += 1,
Op::DecPointer => pointer -= 1,
Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
Op::OutData => print_byte(data[pointer]),
Op::InpData => data[pointer] = read_byte(),
Op::CondJumpForward => {
if data[pointer] == 0 {
// Skip until matching end.
i = *pairs.get(&i).unwrap();
}
}
Op::CondJumpBackward => {
if data[pointer] != 0 {
// Go back to matching start.
i = *pairs_reverse.get(&i).unwrap();
}
}
}
i += 1;
if i >= len {
break;
}
}
}
fn main() {
let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
.expect("Failed to read file");
execute(src.as_str());
}
Simple calculator
This page (including the images) was contributed by ynn.
When you implement an interpreter for a domain-specific language (DSL), or any programming language, the process typically involves the following steps:
-
Lexing: Splitting the input stream (i.e., source code string) into tokens via a lexer.
-
Parsing: Converting the tokens into an abstract syntax tree (AST) via a parser.
-
Evaluation: Evaluating the AST to produce the result.
In this example, we implement a simple calculator that evaluates arithmetic expressions such as 1 + 2 * 3
or ((1 + 2) * 3 + 4) * 2 + 4 / 3
.
We use logos
as the lexer generator and chumsky
as the parser generator.
1. Try It
Before diving into the implementation details, let's play with it1.
$ cargo run --example calculator '1 + 7 * (3 - 4) / 2'
You first need to clone this repository.
Output:
[AST]
Add(
Int(
1,
),
Div(
Mul(
Int(
7,
),
Sub(
Int(
3,
),
Int(
4,
),
),
),
Int(
2,
),
),
)
[result]
-2
Full Code
Full Code
use std::env;
use chumsky::prelude::*;
use logos::Logos;
#[derive(Logos, Debug, PartialEq, Eq, Hash, Clone)]
#[logos(skip r"[ \t\n]+")]
#[logos(error = String)]
enum Token {
#[token("+")]
Plus,
#[token("-")]
Minus,
#[token("*")]
Multiply,
#[token("/")]
Divide,
#[token("(")]
LParen,
#[token(")")]
RParen,
#[regex("[0-9]+", |lex| lex.slice().parse::<isize>().unwrap())]
Integer(isize),
}
#[derive(Debug)]
enum Expr {
// Integer literal.
Int(isize),
// Unary minus.
Neg(Box<Expr>),
// Binary operators.
Add(Box<Expr>, Box<Expr>),
Sub(Box<Expr>, Box<Expr>),
Mul(Box<Expr>, Box<Expr>),
Div(Box<Expr>, Box<Expr>),
}
impl Expr {
fn eval(&self) -> isize {
match self {
Expr::Int(n) => *n,
Expr::Neg(rhs) => -rhs.eval(),
Expr::Add(lhs, rhs) => lhs.eval() + rhs.eval(),
Expr::Sub(lhs, rhs) => lhs.eval() - rhs.eval(),
Expr::Mul(lhs, rhs) => lhs.eval() * rhs.eval(),
Expr::Div(lhs, rhs) => lhs.eval() / rhs.eval(),
}
}
}
#[allow(clippy::let_and_return)]
fn parser() -> impl Parser<Token, Expr, Error = Simple<Token>> {
recursive(|p| {
let atom = {
let parenthesized = p
.clone()
.delimited_by(just(Token::LParen), just(Token::RParen));
let integer = select! {
Token::Integer(n) => Expr::Int(n),
};
parenthesized.or(integer)
};
let unary = just(Token::Minus)
.repeated()
.then(atom)
.foldr(|_op, rhs| Expr::Neg(Box::new(rhs)));
let binary_1 = unary
.clone()
.then(
just(Token::Multiply)
.or(just(Token::Divide))
.then(unary)
.repeated(),
)
.foldl(|lhs, (op, rhs)| match op {
Token::Multiply => Expr::Mul(Box::new(lhs), Box::new(rhs)),
Token::Divide => Expr::Div(Box::new(lhs), Box::new(rhs)),
_ => unreachable!(),
});
let binary_2 = binary_1
.clone()
.then(
just(Token::Plus)
.or(just(Token::Minus))
.then(binary_1)
.repeated(),
)
.foldl(|lhs, (op, rhs)| match op {
Token::Plus => Expr::Add(Box::new(lhs), Box::new(rhs)),
Token::Minus => Expr::Sub(Box::new(lhs), Box::new(rhs)),
_ => unreachable!(),
});
binary_2
})
.then_ignore(end())
}
fn main() {
//reads the input expression from the command line
let input = env::args()
.nth(1)
.expect("Expected expression argument (e.g. `1 + 7 * (3 - 4) / 5`)");
//creates a lexer instance from the input
let lexer = Token::lexer(&input);
//splits the input into tokens, using the lexer
let mut tokens = vec![];
for (token, span) in lexer.spanned() {
match token {
Ok(token) => tokens.push(token),
Err(e) => {
println!("lexer error at {:?}: {}", span, e);
return;
}
}
}
//parses the tokens to construct an AST
let ast = match parser().parse(tokens) {
Ok(expr) => {
println!("[AST]\n{:#?}", expr);
expr
}
Err(e) => {
println!("parse error: {:#?}", e);
return;
}
};
//evaluates the AST to get the result
println!("\n[result]\n{}", ast.eval());
}
2. Lexer
Our calculator supports the following tokens:
-
Integer literals:
0
,1
,15
, etc; -
Unary operator:
-
; -
Binary operators:
+
,-
,*
,/
; -
Parenthesized expressions:
(3 + 5) * 2
,((1 + 2) * 3 + 4) * 2 + 3 / 2
, etc.
#[derive(Logos, Debug, PartialEq, Eq, Hash, Clone)]
#[logos(skip r"[ \t\n]+")]
#[logos(error = String)]
enum Token {
#[token("+")]
Plus,
#[token("-")]
Minus,
#[token("*")]
Multiply,
#[token("/")]
Divide,
#[token("(")]
LParen,
#[token(")")]
RParen,
#[regex("[0-9]+", |lex| lex.slice().parse::<isize>().unwrap())]
Integer(isize),
}
3. Parser
While it is easy enough to manually implement a parser in this case (e.g., Pratt parsing), let's just use chumsky
crate, which is one of the most popular parser generator libraries in Rust.
3.1 AST Definition
First, we define the AST.
#[derive(Debug)]
enum Expr {
// Integer literal.
Int(isize),
// Unary minus.
Neg(Box<Expr>),
// Binary operators.
Add(Box<Expr>, Box<Expr>),
Sub(Box<Expr>, Box<Expr>),
Mul(Box<Expr>, Box<Expr>),
Div(Box<Expr>, Box<Expr>),
}
Note that
-
We name the enum not
AST
butExpr
because an AST is just nested expressions. -
There is no
Parenthesized
variant because parentheses only affect the order of operations (i.e., precedence), which is reflected in the AST structure. -
Box
is used as a recursive enum is not allowed in Rust.
3.2 Parser Implementation
Next, we define the parser. The code may look a bit complicated if you are not familiar with parser combinator libraries, but it is actually quite simple. See Chumsky's official tutorial for the details.
fn parser() -> impl Parser<Token, Expr, Error = Simple<Token>> {
recursive(|p| {
let atom = {
let parenthesized = p
.clone()
.delimited_by(just(Token::LParen), just(Token::RParen));
let integer = select! {
Token::Integer(n) => Expr::Int(n),
};
parenthesized.or(integer)
};
let unary = just(Token::Minus)
.repeated()
.then(atom)
.foldr(|_op, rhs| Expr::Neg(Box::new(rhs)));
let binary_1 = unary
.clone()
.then(
just(Token::Multiply)
.or(just(Token::Divide))
.then(unary)
.repeated(),
)
.foldl(|lhs, (op, rhs)| match op {
Token::Multiply => Expr::Mul(Box::new(lhs), Box::new(rhs)),
Token::Divide => Expr::Div(Box::new(lhs), Box::new(rhs)),
_ => unreachable!(),
});
let binary_2 = binary_1
.clone()
.then(
just(Token::Plus)
.or(just(Token::Minus))
.then(binary_1)
.repeated(),
)
.foldl(|lhs, (op, rhs)| match op {
Token::Plus => Expr::Add(Box::new(lhs), Box::new(rhs)),
Token::Minus => Expr::Sub(Box::new(lhs), Box::new(rhs)),
_ => unreachable!(),
});
binary_2
})
.then_ignore(end())
}
4. Evaluator
Evaluating the AST is straightforward. We just implement it using depth-first search (DFS) such that the mathematical operations are processed in the correct order.
impl Expr {
fn eval(&self) -> isize {
match self {
Expr::Int(n) => *n,
Expr::Neg(rhs) => -rhs.eval(),
Expr::Add(lhs, rhs) => lhs.eval() + rhs.eval(),
Expr::Sub(lhs, rhs) => lhs.eval() - rhs.eval(),
Expr::Mul(lhs, rhs) => lhs.eval() * rhs.eval(),
Expr::Div(lhs, rhs) => lhs.eval() / rhs.eval(),
}
}
}
Example
Evaluating 1 + 3 * 12
will proceed as below.
5. main()
Function
Finally, we put everything together in the main()
function.
fn main() {
//reads the input expression from the command line
let input = env::args()
.nth(1)
.expect("Expected expression argument (e.g. `1 + 7 * (3 - 4) / 5`)");
//creates a lexer instance from the input
let lexer = Token::lexer(&input);
//splits the input into tokens, using the lexer
let mut tokens = vec![];
for (token, span) in lexer.spanned() {
match token {
Ok(token) => tokens.push(token),
Err(e) => {
println!("lexer error at {:?}: {}", span, e);
return;
}
}
}
//parses the tokens to construct an AST
let ast = match parser().parse(tokens) {
Ok(expr) => {
println!("[AST]\n{:#?}", expr);
expr
}
Err(e) => {
println!("parse error: {:#?}", e);
return;
}
};
//evaluates the AST to get the result
println!("\n[result]\n{}", ast.eval());
}
6. Extend the Calculator
Now that you've implemented a basic calculator, try extending its functionality with the following tasks:
-
Handle zero-division gracefully: The current evaluator panics when zero-division occurs. Change the return type of the evaluator from
isize
toResult<isize, String>
, making it possible to return an error message. -
Add support for the modulo operator (
%
): Update the lexer, parser, and evaluator to handle expressions like10 % 3
. -
Add support for built-in functions: Implement built-in functions such as
abs(x)
,pow(x, y)
orrand()
.
JSON parser
JSON is a widely used format for exchanging data between formats, while being human-readable.
Possible values are defined recursively and can be any of the following:
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(String),
/// An array of values
Array(Vec<Value>),
/// An dictionary mapping keys and values.
Object(HashMap<String, Value>),
}
Object are delimited with braces {
and }
, arrays with brackets [
and ]
, and values with commas ,
. Newlines, tabs or spaces should be ignored by the lexer.
Knowing that, we can construct a lexer with Logos
that will identify all those cases:
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
String(String),
}
The hardest part is to define valid regexes for Number
and String
variants.
The present solution was inspired by
this stackoverflow thread.
Once we have our tokens, we must parse them into actual JSON values. We will proceed be creating 3 functions:
parse_value
for parsing any JSON object, without prior knowledge of its type;parse_array
for parsing an array, assuming we matched[
;- and
parse_object
for parsing an object, assuming we matched{
.
Starting with parsing an arbitrary value, we can easily obtain the four scalar types, Bool
, Null
, Number
, and String
, while we will call the next functions for arrays and objects parsing.
/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
To parse an array, we simply loop between tokens, alternating between parsing values and commas, until a closing bracket is found.
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
A similar approach is used for objects, where the only difference is that we expect (key, value) pairs, separated by a colon.
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
Finally, we provide you the full code that you should be able to run with1:
cargo run --example json examples/example.json
1 You first need to clone this repository.
use logos::{Lexer, Logos, Span};
use std::collections::HashMap;
use std::env;
use std::fs;
type Error = (String, Span);
type Result<T> = std::result::Result<T, Error>;
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
String(String),
}
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(String),
/// An array of values
Array(Vec<Value>),
/// An dictionary mapping keys and values.
Object(HashMap<String, Value>),
}
/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
fn main() {
let filename = env::args().nth(1).expect("Expected file argument");
let src = fs::read_to_string(&filename).expect("Failed to read file");
let mut lexer = Token::lexer(src.as_str());
match parse_value(&mut lexer) {
Ok(value) => println!("{:#?}", value),
Err((msg, span)) => {
use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};
let mut colors = ColorGenerator::new();
let a = colors.next();
Report::build(ReportKind::Error, &filename, 12)
.with_message("Invalid JSON".to_string())
.with_label(
Label::new((&filename, span))
.with_message(msg)
.with_color(a),
)
.finish()
.eprint((&filename, Source::from(src)))
.unwrap();
}
}
}
JSON parser with borrowed values
The previous parser owned its data by allocating strings. This can require quite some memory space, and using borrowed string slices can help use saving space, while also maybe increasing performances.
If you are familiar with Rust's concept of lifetimes,
using &str
string slices instead of owned String
is straightforward:
@ 33c29
- enum Token {
+ enum Token<'source> {
@ 62,63c58,59
- #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
- String(String),
+ #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
+ String(&'source str),
@ 70c66
- enum Value {
- enum Value<'source> {
@ 78c74
- String(String),
+ String(&'source str),
@ 80c76
- Array(Vec<Value>),
+ Array(Vec<Value<'source>>),
@ 82c78
- Object(HashMap<String, Value>),
+ Object(HashMap<&'source str, Value<'source>>),
@ 88c84
- fn parse_value<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 113c109
- fn parse_array<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 167c163
- fn parse_object<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
The above code shows the lines you need to change from the previous example to use borrowed data.
Finally, we provide you the full code that you should be able to run with1:
cargo run --example json-borrowed examples/example.json
1 You first need to clone this repository.
use logos::{Lexer, Logos, Span};
use std::collections::HashMap;
use std::env;
use std::fs;
type Error = (String, Span);
type Result<T> = std::result::Result<T, Error>;
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token<'source> {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
String(&'source str),
}
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value<'source> {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(&'source str),
/// An array of values
Array(Vec<Value<'source>>),
/// An dictionary mapping keys and values.
Object(HashMap<&'source str, Value<'source>>),
}
/// Parse a token stream into a JSON value.
fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
fn main() {
let filename = env::args().nth(1).expect("Expected file argument");
let src = fs::read_to_string(&filename).expect("Failed to read file");
let mut lexer = Token::lexer(src.as_str());
match parse_value(&mut lexer) {
Ok(value) => println!("{:#?}", value),
Err((msg, span)) => {
use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};
let mut colors = ColorGenerator::new();
let a = colors.next();
Report::build(ReportKind::Error, &filename, 12)
.with_message("Invalid JSON".to_string())
.with_label(
Label::new((&filename, span))
.with_message(msg)
.with_color(a),
)
.finish()
.eprint((&filename, Source::from(src)))
.unwrap();
}
}
}
Contributing
If you are considering to contribute to Logos, then this place it for you!
First, we really appreciate people that can help this project grow, and we would like to guide you through the standard contribution process.
There are many ways to help us, and here is a short list of some of them:
- fixing an BUG, by providing a patch (or suggesting in the comments how one could fix it);
- correcting some typos in the documentation, the book, or anywhere else;
- raising an issue about a problem (i.e., opening an issue on GitHub);
- proposing new features (either with an issue or a pull request on GitHub);
- or improving the documentation (either in the crate or in the book).
In any case, GitHub is the place-to-go for anything related to contributing.
Below, we provide a few help pages (or links) to contents that can help you understand Logos' internals and how you can create submit a contribution.
- If you are new to GitHub or git, please consider reading those two guides:
- GitHub’s Hello World;
- and GitHub Pull Request in 100 Seconds (video).
- To setup and test your code locally, see the Setup page.
- To know a bit more how Logos works, check the Internals.
Setup
On this page, you will find all the information needed to run and test your own version of the Logos crate, locally.
We assume you have basic knowledge with git and GitHub. If that is not the case, please refer to the link mentioned in Contributing.
Prerequisites
You need to have both git and Rust installed on your computer, see installation procedures:
Once it's done, clone the Logos repository on your computer:
git clone https://github.com/maciejhirsz/logos.git
If you have a fork of this repository, make sure to clone it instead.
Finally, launch a terminal (i.e., command-line) session and go to the
logos
directory.
Checking the code compiles
A good way to see if you code can compile is to use the eponym command:
cargo check --workspace
Formatting and linting your code
Prior to suggesting changes in a pull request, it is important to both format your code:
cargo fmt
and check against Rust's linter:
cargo clippy
Make sure to run those frequently, otherwise your pull request will probably fail to pass the automated tests.
Testing your code
A code that compiles isn't necessarily correct, and testing it against known cases is of good practice:
cargo test --workspace
You can also run benchmarks:
cargo bench --workspace --benches
Building the documentation
Logos' documentation needs to be built with Rust's nightly toolchain.
You can install the latest nightly channel with:
rustup install nightly
Then, use the following command to build the documentation with a similar configuration to the one used by docs.rs:
RUSTDOCFLAGS="--cfg docsrs" cargo +nightly doc \
--features debug \
-Zunstable-options \
-Zrustdoc-scrape-examples \
--no-deps \
--open \
Building the book
Logos' book can be built with mbBook.
This tool can be installed with cargo
:
cargo install mdbook
You also need to install mdbook-admonish
and its assets:
cargo install mdbook admonish
cd book/ # You must run the next command from the book/ directory
mdbook-admonish install
Then, you can build the book with:
mbook serve book --open
Any change in the ./book
folder will automatically trigger a new build,
and the pages will be live-reloaded.
Internals
Logos' core functionalities are split across four crates:
logos
is the main crate, that you add to your project (inCargo.toml
) to obtain theLogos
derive macro. The public API is limited to this crate, and most users should only use this crate, not the others.logos-derive
is a very simply but necessary crate to exposelogos-codegen
's code as a derive macro.logos-codegen
contains the most technical parts of Logos: the code that reads you tokens definition, and generates optimized code to create blazingly fast lexers. You can read a blog post from the author of Logos to get a small insight of what thelogos-codegen
crate does. In the future, we hope to provide more documents about how this crate works, so people are more likely to understand it and improve it with pull requests (see the Contributing section).logos-cli
is a separate crate, that installs a binary of the same name, and allows to expand theLogos
derive macro into code. It can be installed withcargo install logos-cli
, and usage help can be obtained through thelogos-cli --help
command. This tool can be useful if your tokens definition stays is constant, and you want to reduce compilatio time overhead caused by derive macros.logos-fuzz
is an internal crate that uses afl.rs to find confusing panics before they reach the developer. To use this tool, see the Fuzzing guide
Fuzzing
Fuzzing is a technique to test a piece of software by injecting randomly generated inputs. This can be pretty useful to discover bugs, as pointed out in #407.
Logos' fuzzing crate is powered by afl.rs that finds panics in Logos' methods.
Usage
First, make sure you have cargo-afl
installed,
see the rust-fuzz afl setup guide for installation information.
Next, change your current working directory to be the fuzz
folder.
Building
Before fuzzing, you need to build the target with:
cargo afl build
Fuzzy testing
The recommended way the run tests is with:
cargo afl fuzz -i in -o out ../target/debug/logos-fuzz
Note that it may run for a (very) long time before it encounter any bug.
Replaying a Crash
If you happen to find a bug that crashes the program, you can reply it with
cargo afl run logos-fuzz < out/default/crashes/crash_file
Reporting a Bug
If you encounter a crash and you feel the error message is not appropriate, please report it by opening an issue. Don't forget to include your crash file so we can later reproduce it.