Logos Handbook

Crates.io version shield Docs Crates.io license shield

Logos logo

Hi there!

Logos is a fast and easy to use lexer generator written in Rust. While Rust has excellent documentation tools (and you can access the API docs for Logos at docs.rs), it's not the easiest thing to document custom syntax used by procedural macros, of which Logos has a bit. This Handbook seeks to remedy this!

In a nut shell

There are two main types in Logos:

  • The Logos trait, which comes out with it's own derive macro. The derive macro uses custom attributes (the things using these brackets: #[...]) with plain string or regular expression syntax on enum variants as patterns for some input.
  • The Lexer<T: Logos>, which is an iterator that takes some input (&str, sometimes &[u8]) and performs lexical analysis on the input on the go, producing variants of the enum T matching the defined patterns.

Getting Started

Logos can be included in your Rust project using the cargo add logos command, or by directly modifying your Cargo.toml file:

[dependencies]
logos = "0.15.0"

Then, you can automatically derive the Logos trait on your enum using the Logos derive macro:

#![allow(unused)]
fn main() {
use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")] // Ignore this regex pattern between tokens
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("fast")]
    Fast,

    #[token(".")]
    Period,

    // Or regular expressions.
    #[regex("[a-zA-Z]+")]
    Text,
}
}

Then, you can use Logos::lexer method to turn any &str into an iterator of tokens1:

#![allow(unused)]
fn main() {
let mut lex = Token::lexer("Create ridiculously fast Lexers.");

assert_eq!(lex.next(), Some(Ok(Token::Text)));
assert_eq!(lex.span(), 0..6);
assert_eq!(lex.slice(), "Create");

assert_eq!(lex.next(), Some(Ok(Token::Text)));
assert_eq!(lex.span(), 7..19);
assert_eq!(lex.slice(), "ridiculously");

assert_eq!(lex.next(), Some(Ok(Token::Fast)));
assert_eq!(lex.span(), 20..24);
assert_eq!(lex.slice(), "fast");

assert_eq!(lex.next(), Some(Ok(Token::Text)));
assert_eq!(lex.slice(), "Lexers");
assert_eq!(lex.span(), 25..31);

assert_eq!(lex.next(), Some(Ok(Token::Period)));
assert_eq!(lex.span(), 31..32);
assert_eq!(lex.slice(), ".");

assert_eq!(lex.next(), None);
}
1

Each item is actually a Result<Token, _>, because the lexer returns an error if some part of the string slice does not match any variant of Token.

Because Lexer, returned by Logos::lexer, implements the Iterator trait, you can use a for .. in construct:

#![allow(unused)]
fn main() {
for result in Token::lexer("Create ridiculously fast Lexers.") {
    match result {
        Ok(token) => println!("{:#?}", token),
        Err(e) => panic!("some error occurred: {}", e),
    }
}
}

Getting Help

If you need help using Logos, there are three places you can go to depending on what you are looking for:

  • this book for a documented walk through Logos' usage, with detailed examples, and more. A must read for any newcomer;
  • the API documentation to obtain precise information about function signatures and what the Logos crate exposes in terms of features;
  • and GitHub issues for anything else that is not covered by any of the two above.

Regarding GitHub issues, it's highly recommended to first check if another issue, either open or closed, already covers the topic you are looking for. If not, then consider creating a new issue with necessary information about your question, problem or else.

Attributes

The #[derive(Logos)] procedural macro recognizes three different attribute names.

  • #[logos] is the main attribute which can be attached to the enum of your token definition. It allows you to define the Extras associated type in order to put custom state into the Lexer, or declare concrete types for generic type parameters, if your enum uses such. It is strictly optional. It also allows to define parts that must be skipped by the lexer, the error type, or regex subpatterns.
  • And most importantly the #[token] and #[regex] attributes. Those allow you to define patterns to match against the input, either plain text strings with #[token], or using regular expression syntax with #[regex]. Aside from that difference, they are equivalent, and any extra arguments you can pass to one, you can pass to the other.

#[logos]

As previously said, the #[logos] attribute can be attached to the enum of your token definition to customize your lexer. Note that they all are optional.

The syntax is as follows:

#![allow(unused)]
fn main() {
#[derive(Logos)]
#[logos(skip "regex literal")]
#[logos(extras = ExtrasType)]
#[logos(error = ErrorType)]
#[logos(crate = path::to::logos)]
#[logos(source = SourceType)]
enum Token {
    /* ... */
}
}

where "regex literal" can be any regex supported by #[regex], and ExtrasType can be of any type!

An example usage of skip is provided in the JSON parser example.

For more details about extras, read the eponym section.

Custom error type

By default, Logos uses () as the error type, which means that it doesn't store any information about the error. This can be changed by using #[logos(error = ErrorType)] attribute on the enum. The type ErrorType can be any type that implements Clone, PartialEq, Default and From<E> for each callback's error type.

ErrorType must implement the Default trait because invalid tokens, i.e., literals that do not match any variant, will produce Err(ErrorType::default()).

For example, here is an example using a custom error type:

use logos::Logos;

use std::num::ParseIntError;

#[derive(Default, Debug, Clone, PartialEq)]
enum LexingError {
    InvalidInteger(String),
    #[default]
    NonAsciiCharacter,
}

/// Error type returned by calling `lex.slice().parse()` to u8.
impl From<ParseIntError> for LexingError {
    fn from(err: ParseIntError) -> Self {
        use std::num::IntErrorKind::*;
        match err.kind() {
            PosOverflow | NegOverflow => LexingError::InvalidInteger("overflow error".to_owned()),
            _ => LexingError::InvalidInteger("other error".to_owned()),
        }
    }
}

#[derive(Debug, Logos, PartialEq)]
#[logos(error = LexingError)]
#[logos(skip r"[ \t]+")]
enum Token {
    #[regex(r"[a-zA-Z]+")]
    Word,
    #[regex(r"[0-9]+", |lex| lex.slice().parse())]
    Integer(u8),
}

fn main() {
    // 256 overflows u8, since u8's max value is 255.
    // 'é' is not a valid ascii letter.
    let mut lex = Token::lexer("Hello 256 Jérome");

    assert_eq!(lex.next(), Some(Ok(Token::Word)));
    assert_eq!(lex.slice(), "Hello");

    assert_eq!(
        lex.next(),
        Some(Err(LexingError::InvalidInteger(
            "overflow error".to_owned()
        )))
    );
    assert_eq!(lex.slice(), "256");

    assert_eq!(lex.next(), Some(Ok(Token::Word)));
    assert_eq!(lex.slice(), "J");

    assert_eq!(lex.next(), Some(Err(LexingError::NonAsciiCharacter)));
    assert_eq!(lex.slice(), "é");

    assert_eq!(lex.next(), Some(Ok(Token::Word)));
    assert_eq!(lex.slice(), "rome");

    assert_eq!(lex.next(), None);
}

You can add error variants to LexingError, and implement From<E> for each error type E that could be returned by a callback. See callbacks.

Specifying path to logos

You can force the derive macro to use a different path to Logos's crate with #[logos(crate = path::to::logos)].

Custom source type

By default, Logos's lexer will accept &str as input, unless any of the pattern literals match a non utf-8 bytes sequence. In this case, it will fall back to &[u8]. You can override this behavior by forcing one of the two source types. You can also specify any custom time that implements Source.

#[token] and #[regex]

For each variant your declare in your enum that uses the Logos derive macro, you can specify one or more string literal or regex it can match.

The usage syntax is a follows:

#![allow(unused)]
fn main() {
#[derive(Logos)]
enum Token {
    #[token(literal [, callback, priority = <integer>, ignore(<flag>, ...)]]
    #[regex(literal [, callback, priority = <integer>, ignore(<flag>, ...)]]
    SomeVariant,
}
}

where literal can be any &str or &[u8] string literal, callback can either be a closure, or a literal path to a function (see Using callbacks section), priority can be any positive integer (see Token disambiguation section), and flag can by of: case, ascii_case. Only literal is required, others are optional.

You can stack any number of #[token] and or #[regex] attributes on top of the same variant.

Info

For a list of supported regex literals, read the Common regular expressions section.

Token disambiguation

When two or more tokens can match a given sequence, Logos compute the priority of each pattern (#[token] or #[regex]), and use that priority to decide which pattern should match.

The rule of thumb is:

  • Longer beats shorter.
  • Specific beats generic.

If any two definitions could match the same input, like fast and [a-zA-Z]+ in the example above, it's the longer and more specific definition of Token::Fast that will be the result.

This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative:

  • [a-zA-Z]+ has a priority of 2 (lowest possible), because at minimum it can match a single byte to a class;
  • foobar has a priority of 12;
  • and (foo|hello)(bar)? has a priority of 6, foo being it's shortest possible match.

Generally speaking, equivalent regex patterns have the same priority. E.g., a|b is equivalent to [a-b], and both have a priority of 2.

Info

When two different patterns have the same priority, Logos will issue an compilation error. To prevent this from happening, you can manually set the priority of a given pattern with, e.g., #token("foobar", priority = 20).

Using Extras

When deriving the Logos traits, you may want to convey some internal state between your tokens. That is where Logos::Extras comes to the rescue.

Each Lexer has a public field called extras that can be accessed and mutated to keep track and modify some internal state. By default, this field is set to (), but its type can by modified using the derive attribute #[logos(extras = <some type>)] on your enum declaration.

For example, one may want to know the location, both line and column indices, of each token. This is especially useful when one needs to report an erroneous token to the user, in an user-friendly manner.

/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
    #[regex(r"\n", newline_callback)]
    Newline,

    #[regex(r"\w+", word_callback)]
    Word((usize, usize)),
}

The above token definition will hold two tokens: Newline and Word. The former is only used to keep track of the line numbering and will be skipped using Skip as a return value from its callback function. The latter will be a word with (line, column) indices.

To make it easy, the lexer will contain the following two extras:

  • extras.0: the line number;
  • extras.1: the char index of the current line.

We now have to define the two callback functions:

/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
    lex.extras.0 += 1;
    lex.extras.1 = lex.span().end;
    Skip
}

/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
    let line = lex.extras.0;
    let column = lex.span().start - lex.extras.1;

    (line, column)
}

Extras can of course be used for more complicate logic, and there is no limit to what you can store within the public extras field.

Finally, we provide you the full code that you should be able to run with1:

cargo run --example extras Cargo.toml

1 You first need to clone this repository.

use logos::{Lexer, Logos, Skip};
use std::env;
use std::fs;

/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
    lex.extras.0 += 1;
    lex.extras.1 = lex.span().end;
    Skip
}

/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
    let line = lex.extras.0;
    let column = lex.span().start - lex.extras.1;

    (line, column)
}

/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
    #[regex(r"\n", newline_callback)]
    Newline,

    #[regex(r"\w+", word_callback)]
    Word((usize, usize)),
}

fn main() {
    let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
        .expect("Failed to read file");

    let mut lex = Token::lexer(src.as_str());

    while let Some(token) = lex.next() {
        if let Ok(Token::Word((line, column))) = token {
            println!("Word '{}' found at ({}, {})", lex.slice(), line, column);
        }
    }
}

Using callbacks

Logos can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant:

use logos::{Logos, Lexer};

// Note: callbacks can return `Option` or `Result`
fn kilo(lex: &mut Lexer<Token>) -> Option<u64> {
    let slice = lex.slice();
    let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k'
    Some(n * 1_000)
}

fn mega(lex: &mut Lexer<Token>) -> Option<u64> {
    let slice = lex.slice();
    let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm'
    Some(n * 1_000_000)
}

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
enum Token {
    // Callbacks can use closure syntax, or refer
    // to a function defined elsewhere.
    //
    // Each pattern can have it's own callback.
    #[regex("[0-9]+", |lex| lex.slice().parse().ok())]
    #[regex("[0-9]+k", kilo)]
    #[regex("[0-9]+m", mega)]
    Number(u64),
}

fn main() {
    let mut lex = Token::lexer("5 42k 75m");

    assert_eq!(lex.next(), Some(Ok(Token::Number(5))));
    assert_eq!(lex.slice(), "5");

    assert_eq!(lex.next(), Some(Ok(Token::Number(42_000))));
    assert_eq!(lex.slice(), "42k");

    assert_eq!(lex.next(), Some(Ok(Token::Number(75_000_000))));
    assert_eq!(lex.slice(), "75m");

    assert_eq!(lex.next(), None);
}

Logos can handle callbacks with following return types:

Return typeProduces
()Ok(Token::Unit)
boolOk(Token::Unit) or Err(<Token as Logos>::Error::default())
Result<(), E>Ok(Token::Unit) or Err(<Token as Logos>::Error::from(err))
TOk(Token::Value(T))
Option<T>Ok(Token::Value(T)) or Err(<Token as Logos>::Error::default())
Result<T, E>Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err))
Skipskips matched input
Result<Skip, E>skips matched input or Err(<Token as Logos>::Error::from(err))
Filter<T>Ok(Token::Value(T)) or skips matched input
FilterResult<T, E>Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) or skips matched input

Callbacks can be also used to do perform more specialized lexing in place where regular expressions are too limiting. For specifics look at Lexer::remainder and Lexer::bump.

Common regular expressions

Maybe the most important feature of Logos is its ability to accept regex patterns in your tokens' definition.

Regular expressions, or regexes for short, are sequences of characters (or bytes) that define a match pattern. When constructing lexers, this is especially useful to define tokens that should match a set of similar literals. E.g., a sequence of 3 ASCII uppercase letters and 3 digits could define a license plate, and could be matched with the following regex: "[A-Z]{3}[0-9]{3}".

For more details about regexes in Rust, refer to the regex crate.

Valid regexes that are not supported

Because Logos aims at generating high-performance code, it never allows to do backtracking. This means that anytime a byte is read from the input source, it will never be read again. This implementation choice comes at a cost: not all valid regexes are supported by Logos1.

For reference, Logos parses regexes using regex-syntax = 0.6, and transforms its high-level intermediate representation (HIR) into some medium intermediate representation (MIR). From HIR, MIR does not support the following HirKinds:

  • Non-greedy repetitions, i.e., matching as little as possible as given pattern.
  • ".*" and ".+" repetition patterns, because they will potentially consume all the input source, breaking the non-backtracking rule. For solutions, see footnote1 or read the error message.
  • Word boundaries, i.e., r"\b".
  • Anchors, because input source does not treat lines separately.

Additionally, note that capture groups will silently be ungrouped, because Logos does not support capturing groups, but the main slice (lex.slice()).

1

Most of time, however, it is possible to circumvent this issue by rewriting your regex another way, or by using callbacks. E.g., see #302.

Other issues

Logos' support for regexes is not yet complete, and errors can still exist. Some are found at compile time, and others will create wrong matches or panic.

If you ever feel like your patterns do not match the expected source slices, please check the GitHub issues. If no issue covers your problem, we encourage you to create a new issue, and document it as best as you can so that the issue can be reproduced locally.

Debugging

Instructions on how to debug your Logos lexer.

Visualizing Logos Graph

Logos works by creating a graph that gets derived from the tokens that you defined. This graph describes how the lexer moves through different states when processing input.

Hence, it may be beneficial during debugging to be able to visualize this graph, to understand how Logos will match the various tokens.

If we take this example:

use logos::Logos;

#[derive(Debug, Logos, PartialEq)]
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("fast")]
    Fast,

    #[token(".")]
    Period,

    // Or regular expressions.
    #[regex("[a-zA-Z]+")]
    Text,
}
fn main() {
    let input = "Create ridiculously fast Lexers.";

    let mut lexer = Token::lexer(input);
    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}

Logos actually constructs a graph that contains the logic for matching tokens:

graph = {
    1: ::Fast,
    2: ::Period,
    3: ::Text,
    4: {
        [A-Z] ⇒ 4,
        [a-z] ⇒ 4,
        _ ⇒ 3,
    },
    7: [
        ast ⇒ 8,
        _ ⇒ 4*,
    ],
    8: {
        [A-Z] ⇒ 4,
        [a-z] ⇒ 4,
        _ ⇒ 1,
    },
    9: {
        . ⇒ 2,
        [A-Z] ⇒ 4,
        [a-e] ⇒ 4,
        f ⇒ 7,
        [g-z] ⇒ 4,
    },
}

This graph can help us understand how our patterns are matched, and maybe understand why we have a bug at some point.

Let's get started by trying to understand how Logos is matching the . character, which we've tokenized as Token::Period.

We can begin our search by looking at number 9 for the character .. We can see that if Logos matches a . it will jump => to number 2. We can then follow that by looking at 2 which resolves to our ::Period token.

Logos will then continue to look for any matches past our . character. This is required in case there is potential continuation after the . character. Although, in the input we provided, there are no any additional characters, since it is the end of our input.

We also can try to identify how the token fast works by looking at 9, first, and seeing that f will cause Logos to jump to 7. This will then resolve the last letters of our word fast by matching ast which jumps to 8. Since our provided input to the lexer does not include alphabetic characters after the word "fast", but rather a whitespace, the token ::Fast will be recognized. Then, the graph will look for further potential continuation (here, [g-z] => 4)

Enabling

To enable debugging output you can define a debug feature in your Cargo.toml file, like this:

// Cargo.toml
[dependencies]
logos = { version = "1.2.3", features = ["debug"] }

Next, you can build your project with cargo build and the output will contain a debug representation of your graph(s).

Unsafe Code

By default, Logos uses unsafe code to avoid unnecessary bounds checks while accessing slices of the input Source.

This unsafe code also exists in the code generated by the Logos derive macro, which generates a deterministic finite automata (DFA). Reasoning about the correctness of this generated code can be difficult - if the derivation of the DFA in Logos is correct, then this generated code will be correct and any mistakes in implementation would be caught given sufficient fuzz testing.

Use of unsafe code is the default as this typically provides the fastest parser.

Disabling Unsafe Code

However, for applications accepting untrusted input in a trusted context, this may not be a sufficient correctness justification.

For those applications which cannot tolerate unsafe code, the feature forbid-unsafe may be enabled. This replaces unchecked accesses in the Logos crate with safe, checked alternatives which will panic on out-of-bounds access rather than cause undefined behavior. Additionally, code generated by the macro will not use the unsafe keyword, so generated code may be used in a crates using the #![forbid(unsafe_code)] attribute.

When the forbid-unsafe feature is added to a direct dependency on the Logos crate, Feature Unification ensures any transitive inclusion of Logos via other dependencies also have unsafe code disabled.

Generally, disabling unsafe code will result in a slower parser.

However making definitive statements around performance of safe-only code is difficult, as there are too many variables to consider between compiler optimizations, the specific grammar being parsed, and the target processor. The automated benchmarks of this crate show around a 10% slowdown in safe-only code at the time of this writing.

Examples

The following examples are ordered by increasing level of complexity.

Brainfuck interpreter: Lexers are very powerful tools for parsing code programs into meaningful instructions. We show you how you can build an interpreter for the Brainfuck programming language under 100 lines of code!

Simple calculator: For a relatively large domain-specifc language (DSL), or any programming language, implementing an interpreter typically involves converting the tokens generated by a lexer into an abstract syntax tree (AST) via a parser, and then evaluating it. We show you how you can build a simple calculator that evaluates arithmetic expressions by combining Logos and a parser generator library.

JSON parser: We present a JSON parser written with Logos that does nice error reporting when invalid values are encountered.

JSON-borrowed parser: A variant of the previous parser, but that does not own its data.

Brainfuck interpreter

In most programming languages, commands can be made of multiple program tokens, where a token is simply string slice that has a particular meaning for the language. For example, in Rust, the function signature pub fn main() could be split by the lexer into tokens pub, fn, main, (, and ). Then, the parser combines tokens into meaningful program instructions.

However, there exists programming languages that are so simple, such as Brainfuck, that each token can be mapped to a single instruction. There are actually 8 single-characters tokens:

/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
    /// Increment pointer.
    #[token(">")]
    IncPointer,
    /// Decrement pointer.
    #[token("<")]
    DecPointer,
    /// Increment data at pointer.
    #[token("+")]
    IncData,
    /// Decrement data at pointer.
    #[token("-")]
    DecData,
    /// Output data at pointer.
    #[token(".")]
    OutData,
    /// Input (read) to data at pointer.
    #[token(",")]
    InpData,
    /// Conditionally jump to matching `']'`.
    #[token("[")]
    CondJumpForward,
    /// Conditionally jump to matching `'['`.
    #[token("]")]
    CondJumpBackward,
}

All other characters must be ignored.

Once the tokens are obtained, a Brainfuck interpreter can be easily created using a Finite-state machine. For the sake of simpliciy, we collected all the tokens into one vector called operations.

Now, creating an interpreter becomes straightforward1:

    let mut i: usize = 0;
    // True program execution.
    loop {
        match operations[i] {
            Op::IncPointer => pointer += 1,
            Op::DecPointer => pointer -= 1,
            Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
            Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
            Op::OutData => print_byte(data[pointer]),
            Op::InpData => data[pointer] = read_byte(),
            Op::CondJumpForward => {
                if data[pointer] == 0 {
                    // Skip until matching end.
                    i = *pairs.get(&i).unwrap();
                }
            }
            Op::CondJumpBackward => {
                if data[pointer] != 0 {
                    // Go back to matching start.
                    i = *pairs_reverse.get(&i).unwrap();
                }
            }
        }
        i += 1;

        if i >= len {
            break;
        }
    }
1

There is a small trick to make it easy. As it can be seen in the full code, we first perform a check that all beginning loops ('[') have a matching end (']'). This way, we can create two maps, pairs and pairs_reverse, to easily jump back and forth between them.

Finally, we provide you the full code that you should be able to run with2:

cargo run --example brainfuck examples/hello_word.bf
2

You first need to clone this repository.

use logos::Logos;
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::{self, Read};

/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
    /// Increment pointer.
    #[token(">")]
    IncPointer,
    /// Decrement pointer.
    #[token("<")]
    DecPointer,
    /// Increment data at pointer.
    #[token("+")]
    IncData,
    /// Decrement data at pointer.
    #[token("-")]
    DecData,
    /// Output data at pointer.
    #[token(".")]
    OutData,
    /// Input (read) to data at pointer.
    #[token(",")]
    InpData,
    /// Conditionally jump to matching `']'`.
    #[token("[")]
    CondJumpForward,
    /// Conditionally jump to matching `'['`.
    #[token("]")]
    CondJumpBackward,
}

/// Print one byte to the terminal.
#[inline(always)]
fn print_byte(byte: u8) {
    print!("{}", byte as char);
}

/// Read one byte from the terminal.
#[inline(always)]
fn read_byte() -> u8 {
    let mut input = [0u8; 1];
    io::stdin()
        .read_exact(&mut input)
        .expect("An error occurred while reading byte!");
    input[0]
}

/// Execute Brainfuck code from a string slice.
pub fn execute(code: &str) {
    let operations: Vec<_> = Op::lexer(code).filter_map(|op| op.ok()).collect();
    let mut data = [0u8; 30_000]; // Minimum recommended size
    let mut pointer: usize = 0;
    let len = operations.len();

    // We pre-process matching jump commands, and we create
    // a mapping between them.
    let mut queue = Vec::new();
    let mut pairs = HashMap::new();
    let mut pairs_reverse = HashMap::new();

    for (i, op) in operations.iter().enumerate() {
        match op {
            Op::CondJumpForward => queue.push(i),
            Op::CondJumpBackward => {
                if let Some(start) = queue.pop() {
                    pairs.insert(start, i);
                    pairs_reverse.insert(i, start);
                } else {
                    panic!(
                        "Unexpected conditional backward jump at position {}, does not match any '['",
                        i
                    );
                }
            }
            _ => (),
        }
    }

    if !queue.is_empty() {
        panic!("Unmatched conditional forward jump at positions {:?}, expecting a closing ']' for each of them", queue);
    }

    let mut i: usize = 0;
    // True program execution.
    loop {
        match operations[i] {
            Op::IncPointer => pointer += 1,
            Op::DecPointer => pointer -= 1,
            Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
            Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
            Op::OutData => print_byte(data[pointer]),
            Op::InpData => data[pointer] = read_byte(),
            Op::CondJumpForward => {
                if data[pointer] == 0 {
                    // Skip until matching end.
                    i = *pairs.get(&i).unwrap();
                }
            }
            Op::CondJumpBackward => {
                if data[pointer] != 0 {
                    // Go back to matching start.
                    i = *pairs_reverse.get(&i).unwrap();
                }
            }
        }
        i += 1;

        if i >= len {
            break;
        }
    }
}

fn main() {
    let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
        .expect("Failed to read file");

    execute(src.as_str());
}

Simple calculator

This page (including the images) was contributed by ynn.

When you implement an interpreter for a domain-specific language (DSL), or any programming language, the process typically involves the following steps:

  1. Lexing: Splitting the input stream (i.e., source code string) into tokens via a lexer.

  2. Parsing: Converting the tokens into an abstract syntax tree (AST) via a parser.

  3. Evaluation: Evaluating the AST to produce the result.

In this example, we implement a simple calculator that evaluates arithmetic expressions such as 1 + 2 * 3 or ((1 + 2) * 3 + 4) * 2 + 4 / 3.

We use logos as the lexer generator and chumsky as the parser generator.

flow chart

1. Try It

Before diving into the implementation details, let's play with it1.

$ cargo run --example calculator '1 + 7 * (3 - 4) / 2'
1

You first need to clone this repository.

Output:

[AST]
Add(
    Int(
        1,
    ),
    Div(
        Mul(
            Int(
                7,
            ),
            Sub(
                Int(
                    3,
                ),
                Int(
                    4,
                ),
            ),
        ),
        Int(
            2,
        ),
    ),
)

[result]
-2

Full Code

use std::env;

use chumsky::prelude::*;
use logos::Logos;

#[derive(Logos, Debug, PartialEq, Eq, Hash, Clone)]
#[logos(skip r"[ \t\n]+")]
#[logos(error = String)]
enum Token {
    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[token("*")]
    Multiply,

    #[token("/")]
    Divide,

    #[token("(")]
    LParen,

    #[token(")")]
    RParen,

    #[regex("[0-9]+", |lex| lex.slice().parse::<isize>().unwrap())]
    Integer(isize),
}

#[derive(Debug)]
enum Expr {
    // Integer literal.
    Int(isize),

    // Unary minus.
    Neg(Box<Expr>),

    // Binary operators.
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
}

impl Expr {
    fn eval(&self) -> isize {
        match self {
            Expr::Int(n) => *n,
            Expr::Neg(rhs) => -rhs.eval(),
            Expr::Add(lhs, rhs) => lhs.eval() + rhs.eval(),
            Expr::Sub(lhs, rhs) => lhs.eval() - rhs.eval(),
            Expr::Mul(lhs, rhs) => lhs.eval() * rhs.eval(),
            Expr::Div(lhs, rhs) => lhs.eval() / rhs.eval(),
        }
    }
}

#[allow(clippy::let_and_return)]
fn parser() -> impl Parser<Token, Expr, Error = Simple<Token>> {
    recursive(|p| {
        let atom = {
            let parenthesized = p
                .clone()
                .delimited_by(just(Token::LParen), just(Token::RParen));

            let integer = select! {
                Token::Integer(n) => Expr::Int(n),
            };

            parenthesized.or(integer)
        };

        let unary = just(Token::Minus)
            .repeated()
            .then(atom)
            .foldr(|_op, rhs| Expr::Neg(Box::new(rhs)));

        let binary_1 = unary
            .clone()
            .then(
                just(Token::Multiply)
                    .or(just(Token::Divide))
                    .then(unary)
                    .repeated(),
            )
            .foldl(|lhs, (op, rhs)| match op {
                Token::Multiply => Expr::Mul(Box::new(lhs), Box::new(rhs)),
                Token::Divide => Expr::Div(Box::new(lhs), Box::new(rhs)),
                _ => unreachable!(),
            });

        let binary_2 = binary_1
            .clone()
            .then(
                just(Token::Plus)
                    .or(just(Token::Minus))
                    .then(binary_1)
                    .repeated(),
            )
            .foldl(|lhs, (op, rhs)| match op {
                Token::Plus => Expr::Add(Box::new(lhs), Box::new(rhs)),
                Token::Minus => Expr::Sub(Box::new(lhs), Box::new(rhs)),
                _ => unreachable!(),
            });

        binary_2
    })
    .then_ignore(end())
}

fn main() {
    //reads the input expression from the command line
    let input = env::args()
        .nth(1)
        .expect("Expected expression argument (e.g. `1 + 7 * (3 - 4) / 5`)");

    //creates a lexer instance from the input
    let lexer = Token::lexer(&input);

    //splits the input into tokens, using the lexer
    let mut tokens = vec![];
    for (token, span) in lexer.spanned() {
        match token {
            Ok(token) => tokens.push(token),
            Err(e) => {
                println!("lexer error at {:?}: {}", span, e);
                return;
            }
        }
    }

    //parses the tokens to construct an AST
    let ast = match parser().parse(tokens) {
        Ok(expr) => {
            println!("[AST]\n{:#?}", expr);
            expr
        }
        Err(e) => {
            println!("parse error: {:#?}", e);
            return;
        }
    };

    //evaluates the AST to get the result
    println!("\n[result]\n{}", ast.eval());
}

2. Lexer

Our calculator supports the following tokens:

  • Integer literals: 0, 1, 15, etc;

  • Unary operator: -;

  • Binary operators: +, -, *, /;

  • Parenthesized expressions: (3 + 5) * 2, ((1 + 2) * 3 + 4) * 2 + 3 / 2, etc.

#[derive(Logos, Debug, PartialEq, Eq, Hash, Clone)]
#[logos(skip r"[ \t\n]+")]
#[logos(error = String)]
enum Token {
    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[token("*")]
    Multiply,

    #[token("/")]
    Divide,

    #[token("(")]
    LParen,

    #[token(")")]
    RParen,

    #[regex("[0-9]+", |lex| lex.slice().parse::<isize>().unwrap())]
    Integer(isize),
}

3. Parser

While it is easy enough to manually implement a parser in this case (e.g., Pratt parsing), let's just use chumsky crate, which is one of the most popular parser generator libraries in Rust.

3.1 AST Definition

First, we define the AST.

#[derive(Debug)]
enum Expr {
    // Integer literal.
    Int(isize),

    // Unary minus.
    Neg(Box<Expr>),

    // Binary operators.
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
}

Note that

  • We name the enum not AST but Expr because an AST is just nested expressions.

  • There is no Parenthesized variant because parentheses only affect the order of operations (i.e., precedence), which is reflected in the AST structure.

  • Box is used as a recursive enum is not allowed in Rust.

3.2 Parser Implementation

Next, we define the parser. The code may look a bit complicated if you are not familiar with parser combinator libraries, but it is actually quite simple. See Chumsky's official tutorial for the details.

fn parser() -> impl Parser<Token, Expr, Error = Simple<Token>> {
    recursive(|p| {
        let atom = {
            let parenthesized = p
                .clone()
                .delimited_by(just(Token::LParen), just(Token::RParen));

            let integer = select! {
                Token::Integer(n) => Expr::Int(n),
            };

            parenthesized.or(integer)
        };

        let unary = just(Token::Minus)
            .repeated()
            .then(atom)
            .foldr(|_op, rhs| Expr::Neg(Box::new(rhs)));

        let binary_1 = unary
            .clone()
            .then(
                just(Token::Multiply)
                    .or(just(Token::Divide))
                    .then(unary)
                    .repeated(),
            )
            .foldl(|lhs, (op, rhs)| match op {
                Token::Multiply => Expr::Mul(Box::new(lhs), Box::new(rhs)),
                Token::Divide => Expr::Div(Box::new(lhs), Box::new(rhs)),
                _ => unreachable!(),
            });

        let binary_2 = binary_1
            .clone()
            .then(
                just(Token::Plus)
                    .or(just(Token::Minus))
                    .then(binary_1)
                    .repeated(),
            )
            .foldl(|lhs, (op, rhs)| match op {
                Token::Plus => Expr::Add(Box::new(lhs), Box::new(rhs)),
                Token::Minus => Expr::Sub(Box::new(lhs), Box::new(rhs)),
                _ => unreachable!(),
            });

        binary_2
    })
    .then_ignore(end())
}

4. Evaluator

Evaluating the AST is straightforward. We just implement it using depth-first search (DFS) such that the mathematical operations are processed in the correct order.

impl Expr {
    fn eval(&self) -> isize {
        match self {
            Expr::Int(n) => *n,
            Expr::Neg(rhs) => -rhs.eval(),
            Expr::Add(lhs, rhs) => lhs.eval() + rhs.eval(),
            Expr::Sub(lhs, rhs) => lhs.eval() - rhs.eval(),
            Expr::Mul(lhs, rhs) => lhs.eval() * rhs.eval(),
            Expr::Div(lhs, rhs) => lhs.eval() / rhs.eval(),
        }
    }
}

Example

Evaluating 1 + 3 * 12 will proceed as below.

how evaluator works

5. main() Function

Finally, we put everything together in the main() function.

fn main() {
    //reads the input expression from the command line
    let input = env::args()
        .nth(1)
        .expect("Expected expression argument (e.g. `1 + 7 * (3 - 4) / 5`)");

    //creates a lexer instance from the input
    let lexer = Token::lexer(&input);

    //splits the input into tokens, using the lexer
    let mut tokens = vec![];
    for (token, span) in lexer.spanned() {
        match token {
            Ok(token) => tokens.push(token),
            Err(e) => {
                println!("lexer error at {:?}: {}", span, e);
                return;
            }
        }
    }

    //parses the tokens to construct an AST
    let ast = match parser().parse(tokens) {
        Ok(expr) => {
            println!("[AST]\n{:#?}", expr);
            expr
        }
        Err(e) => {
            println!("parse error: {:#?}", e);
            return;
        }
    };

    //evaluates the AST to get the result
    println!("\n[result]\n{}", ast.eval());
}

6. Extend the Calculator

Now that you've implemented a basic calculator, try extending its functionality with the following tasks:

  • Handle zero-division gracefully: The current evaluator panics when zero-division occurs. Change the return type of the evaluator from isize to Result<isize, String>, making it possible to return an error message.

  • Add support for the modulo operator (%): Update the lexer, parser, and evaluator to handle expressions like 10 % 3.

  • Add support for built-in functions: Implement built-in functions such as abs(x), pow(x, y) or rand().

JSON parser

JSON is a widely used format for exchanging data between formats, while being human-readable.

Possible values are defined recursively and can be any of the following:

/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
    /// null.
    Null,
    /// true or false.
    Bool(bool),
    /// Any floating point number.
    Number(f64),
    /// Any quoted string.
    String(String),
    /// An array of values
    Array(Vec<Value>),
    /// An dictionary mapping keys and values.
    Object(HashMap<String, Value>),
}

Object are delimited with braces { and }, arrays with brackets [ and ], and values with commas ,. Newlines, tabs or spaces should be ignored by the lexer.

Knowing that, we can construct a lexer with Logos that will identify all those cases:

/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
    #[token("false", |_| false)]
    #[token("true", |_| true)]
    Bool(bool),

    #[token("{")]
    BraceOpen,

    #[token("}")]
    BraceClose,

    #[token("[")]
    BracketOpen,

    #[token("]")]
    BracketClose,

    #[token(":")]
    Colon,

    #[token(",")]
    Comma,

    #[token("null")]
    Null,

    #[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
    Number(f64),

    #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
    String(String),
}

Note

The hardest part is to define valid regexes for Number and String variants. The present solution was inspired by this stackoverflow thread.

Once we have our tokens, we must parse them into actual JSON values. We will proceed be creating 3 functions:

  • parse_value for parsing any JSON object, without prior knowledge of its type;
  • parse_array for parsing an array, assuming we matched [;
  • and parse_object for parsing an object, assuming we matched {.

Starting with parsing an arbitrary value, we can easily obtain the four scalar types, Bool, Null, Number, and String, while we will call the next functions for arrays and objects parsing.

/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    if let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
            Ok(Token::BraceOpen) => parse_object(lexer),
            Ok(Token::BracketOpen) => parse_array(lexer),
            Ok(Token::Null) => Ok(Value::Null),
            Ok(Token::Number(n)) => Ok(Value::Number(n)),
            Ok(Token::String(s)) => Ok(Value::String(s)),
            _ => Err((
                "unexpected token here (context: value)".to_owned(),
                lexer.span(),
            )),
        }
    } else {
        Err(("empty values are not allowed".to_owned(), lexer.span()))
    }
}

To parse an array, we simply loop between tokens, alternating between parsing values and commas, until a closing bracket is found.

/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    let mut array = Vec::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_value = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) if !awaits_comma => {
                array.push(Value::Bool(b));
                awaits_value = false;
            }
            Ok(Token::BraceOpen) if !awaits_comma => {
                let object = parse_object(lexer)?;
                array.push(object);
                awaits_value = false;
            }
            Ok(Token::BracketOpen) if !awaits_comma => {
                let sub_array = parse_array(lexer)?;
                array.push(sub_array);
                awaits_value = false;
            }
            Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
            Ok(Token::Comma) if awaits_comma => awaits_value = true,
            Ok(Token::Null) if !awaits_comma => {
                array.push(Value::Null);
                awaits_value = false
            }
            Ok(Token::Number(n)) if !awaits_comma => {
                array.push(Value::Number(n));
                awaits_value = false;
            }
            Ok(Token::String(s)) if !awaits_comma => {
                array.push(Value::String(s));
                awaits_value = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: array)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_value;
    }
    Err(("unmatched opening bracket defined here".to_owned(), span))
}

A similar approach is used for objects, where the only difference is that we expect (key, value) pairs, separated by a colon.

/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    let mut map = HashMap::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_key = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
            Ok(Token::Comma) if awaits_comma => awaits_key = true,
            Ok(Token::String(key)) if !awaits_comma => {
                match lexer.next() {
                    Some(Ok(Token::Colon)) => (),
                    _ => {
                        return Err((
                            "unexpected token here, expecting ':'".to_owned(),
                            lexer.span(),
                        ))
                    }
                }
                let value = parse_value(lexer)?;
                map.insert(key, value);
                awaits_key = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: object)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_key;
    }
    Err(("unmatched opening brace defined here".to_owned(), span))
}

Finally, we provide you the full code that you should be able to run with1:

cargo run --example json examples/example.json

1 You first need to clone this repository.

use logos::{Lexer, Logos, Span};

use std::collections::HashMap;
use std::env;
use std::fs;

type Error = (String, Span);

type Result<T> = std::result::Result<T, Error>;

/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
    #[token("false", |_| false)]
    #[token("true", |_| true)]
    Bool(bool),

    #[token("{")]
    BraceOpen,

    #[token("}")]
    BraceClose,

    #[token("[")]
    BracketOpen,

    #[token("]")]
    BracketClose,

    #[token(":")]
    Colon,

    #[token(",")]
    Comma,

    #[token("null")]
    Null,

    #[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
    Number(f64),

    #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
    String(String),
}

/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
    /// null.
    Null,
    /// true or false.
    Bool(bool),
    /// Any floating point number.
    Number(f64),
    /// Any quoted string.
    String(String),
    /// An array of values
    Array(Vec<Value>),
    /// An dictionary mapping keys and values.
    Object(HashMap<String, Value>),
}

/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    if let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
            Ok(Token::BraceOpen) => parse_object(lexer),
            Ok(Token::BracketOpen) => parse_array(lexer),
            Ok(Token::Null) => Ok(Value::Null),
            Ok(Token::Number(n)) => Ok(Value::Number(n)),
            Ok(Token::String(s)) => Ok(Value::String(s)),
            _ => Err((
                "unexpected token here (context: value)".to_owned(),
                lexer.span(),
            )),
        }
    } else {
        Err(("empty values are not allowed".to_owned(), lexer.span()))
    }
}

/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    let mut array = Vec::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_value = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) if !awaits_comma => {
                array.push(Value::Bool(b));
                awaits_value = false;
            }
            Ok(Token::BraceOpen) if !awaits_comma => {
                let object = parse_object(lexer)?;
                array.push(object);
                awaits_value = false;
            }
            Ok(Token::BracketOpen) if !awaits_comma => {
                let sub_array = parse_array(lexer)?;
                array.push(sub_array);
                awaits_value = false;
            }
            Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
            Ok(Token::Comma) if awaits_comma => awaits_value = true,
            Ok(Token::Null) if !awaits_comma => {
                array.push(Value::Null);
                awaits_value = false
            }
            Ok(Token::Number(n)) if !awaits_comma => {
                array.push(Value::Number(n));
                awaits_value = false;
            }
            Ok(Token::String(s)) if !awaits_comma => {
                array.push(Value::String(s));
                awaits_value = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: array)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_value;
    }
    Err(("unmatched opening bracket defined here".to_owned(), span))
}

/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
    let mut map = HashMap::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_key = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
            Ok(Token::Comma) if awaits_comma => awaits_key = true,
            Ok(Token::String(key)) if !awaits_comma => {
                match lexer.next() {
                    Some(Ok(Token::Colon)) => (),
                    _ => {
                        return Err((
                            "unexpected token here, expecting ':'".to_owned(),
                            lexer.span(),
                        ))
                    }
                }
                let value = parse_value(lexer)?;
                map.insert(key, value);
                awaits_key = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: object)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_key;
    }
    Err(("unmatched opening brace defined here".to_owned(), span))
}

fn main() {
    let filename = env::args().nth(1).expect("Expected file argument");
    let src = fs::read_to_string(&filename).expect("Failed to read file");

    let mut lexer = Token::lexer(src.as_str());

    match parse_value(&mut lexer) {
        Ok(value) => println!("{:#?}", value),
        Err((msg, span)) => {
            use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};

            let mut colors = ColorGenerator::new();

            let a = colors.next();

            Report::build(ReportKind::Error, &filename, 12)
                .with_message("Invalid JSON".to_string())
                .with_label(
                    Label::new((&filename, span))
                        .with_message(msg)
                        .with_color(a),
                )
                .finish()
                .eprint((&filename, Source::from(src)))
                .unwrap();
        }
    }
}

JSON parser with borrowed values

The previous parser owned its data by allocating strings. This can require quite some memory space, and using borrowed string slices can help use saving space, while also maybe increasing performances.

If you are familiar with Rust's concept of lifetimes, using &str string slices instead of owned String is straightforward:

@ 33c29
- enum Token {
+ enum Token<'source> {
@ 62,63c58,59
-     #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
-     String(String),
+     #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
+     String(&'source str),
@ 70c66
- enum Value {
- enum Value<'source> {
@ 78c74
-     String(String),
+     String(&'source str),
@ 80c76
-     Array(Vec<Value>),
+     Array(Vec<Value<'source>>),
@ 82c78
-     Object(HashMap<String, Value>),
+     Object(HashMap<&'source str, Value<'source>>),
@ 88c84
- fn parse_value<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 113c109
- fn parse_array<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 167c163
- fn parse_object<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {

The above code shows the lines you need to change from the previous example to use borrowed data.

Finally, we provide you the full code that you should be able to run with1:

cargo run --example json-borrowed examples/example.json

1 You first need to clone this repository.

use logos::{Lexer, Logos, Span};

use std::collections::HashMap;
use std::env;
use std::fs;

type Error = (String, Span);

type Result<T> = std::result::Result<T, Error>;

/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token<'source> {
    #[token("false", |_| false)]
    #[token("true", |_| true)]
    Bool(bool),

    #[token("{")]
    BraceOpen,

    #[token("}")]
    BraceClose,

    #[token("[")]
    BracketOpen,

    #[token("]")]
    BracketClose,

    #[token(":")]
    Colon,

    #[token(",")]
    Comma,

    #[token("null")]
    Null,

    #[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
    Number(f64),

    #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
    String(&'source str),
}

/// Represent any valid JSON value.
#[derive(Debug)]
enum Value<'source> {
    /// null.
    Null,
    /// true or false.
    Bool(bool),
    /// Any floating point number.
    Number(f64),
    /// Any quoted string.
    String(&'source str),
    /// An array of values
    Array(Vec<Value<'source>>),
    /// An dictionary mapping keys and values.
    Object(HashMap<&'source str, Value<'source>>),
}

/// Parse a token stream into a JSON value.
fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
    if let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
            Ok(Token::BraceOpen) => parse_object(lexer),
            Ok(Token::BracketOpen) => parse_array(lexer),
            Ok(Token::Null) => Ok(Value::Null),
            Ok(Token::Number(n)) => Ok(Value::Number(n)),
            Ok(Token::String(s)) => Ok(Value::String(s)),
            _ => Err((
                "unexpected token here (context: value)".to_owned(),
                lexer.span(),
            )),
        }
    } else {
        Err(("empty values are not allowed".to_owned(), lexer.span()))
    }
}

/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
    let mut array = Vec::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_value = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::Bool(b)) if !awaits_comma => {
                array.push(Value::Bool(b));
                awaits_value = false;
            }
            Ok(Token::BraceOpen) if !awaits_comma => {
                let object = parse_object(lexer)?;
                array.push(object);
                awaits_value = false;
            }
            Ok(Token::BracketOpen) if !awaits_comma => {
                let sub_array = parse_array(lexer)?;
                array.push(sub_array);
                awaits_value = false;
            }
            Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
            Ok(Token::Comma) if awaits_comma => awaits_value = true,
            Ok(Token::Null) if !awaits_comma => {
                array.push(Value::Null);
                awaits_value = false
            }
            Ok(Token::Number(n)) if !awaits_comma => {
                array.push(Value::Number(n));
                awaits_value = false;
            }
            Ok(Token::String(s)) if !awaits_comma => {
                array.push(Value::String(s));
                awaits_value = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: array)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_value;
    }
    Err(("unmatched opening bracket defined here".to_owned(), span))
}

/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
    let mut map = HashMap::new();
    let span = lexer.span();
    let mut awaits_comma = false;
    let mut awaits_key = false;

    while let Some(token) = lexer.next() {
        match token {
            Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
            Ok(Token::Comma) if awaits_comma => awaits_key = true,
            Ok(Token::String(key)) if !awaits_comma => {
                match lexer.next() {
                    Some(Ok(Token::Colon)) => (),
                    _ => {
                        return Err((
                            "unexpected token here, expecting ':'".to_owned(),
                            lexer.span(),
                        ))
                    }
                }
                let value = parse_value(lexer)?;
                map.insert(key, value);
                awaits_key = false;
            }
            _ => {
                return Err((
                    "unexpected token here (context: object)".to_owned(),
                    lexer.span(),
                ))
            }
        }
        awaits_comma = !awaits_key;
    }
    Err(("unmatched opening brace defined here".to_owned(), span))
}

fn main() {
    let filename = env::args().nth(1).expect("Expected file argument");
    let src = fs::read_to_string(&filename).expect("Failed to read file");

    let mut lexer = Token::lexer(src.as_str());

    match parse_value(&mut lexer) {
        Ok(value) => println!("{:#?}", value),
        Err((msg, span)) => {
            use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};

            let mut colors = ColorGenerator::new();

            let a = colors.next();

            Report::build(ReportKind::Error, &filename, 12)
                .with_message("Invalid JSON".to_string())
                .with_label(
                    Label::new((&filename, span))
                        .with_message(msg)
                        .with_color(a),
                )
                .finish()
                .eprint((&filename, Source::from(src)))
                .unwrap();
        }
    }
}

Contributing

If you are considering to contribute to Logos, then this place it for you!

First, we really appreciate people that can help this project grow, and we would like to guide you through the standard contribution process.

There are many ways to help us, and here is a short list of some of them:

  • fixing an BUG, by providing a patch (or suggesting in the comments how one could fix it);
  • correcting some typos in the documentation, the book, or anywhere else;
  • raising an issue about a problem (i.e., opening an issue on GitHub);
  • proposing new features (either with an issue or a pull request on GitHub);
  • or improving the documentation (either in the crate or in the book).

In any case, GitHub is the place-to-go for anything related to contributing.

Below, we provide a few help pages (or links) to contents that can help you understand Logos' internals and how you can create submit a contribution.

Setup

On this page, you will find all the information needed to run and test your own version of the Logos crate, locally.

We assume you have basic knowledge with git and GitHub. If that is not the case, please refer to the link mentioned in Contributing.

Prerequisites

You need to have both git and Rust installed on your computer, see installation procedures:

Once it's done, clone the Logos repository on your computer:

git clone https://github.com/maciejhirsz/logos.git

If you have a fork of this repository, make sure to clone it instead.

Finally, launch a terminal (i.e., command-line) session and go to the logos directory.

Checking the code compiles

A good way to see if you code can compile is to use the eponym command:

cargo check --workspace

Formatting and linting your code

Prior to suggesting changes in a pull request, it is important to both format your code:

cargo fmt

and check against Rust's linter:

cargo clippy

Make sure to run those frequently, otherwise your pull request will probably fail to pass the automated tests.

Testing your code

A code that compiles isn't necessarily correct, and testing it against known cases is of good practice:

cargo test --workspace

You can also run benchmarks:

cargo bench --workspace --benches

Building the documentation

Logos' documentation needs to be built with Rust's nightly toolchain.

You can install the latest nightly channel with:

rustup install nightly

Then, use the following command to build the documentation with a similar configuration to the one used by docs.rs:

RUSTDOCFLAGS="--cfg docsrs" cargo +nightly doc \
    --features debug \
    -Zunstable-options \
    -Zrustdoc-scrape-examples \
    --no-deps \
    --open \

Building the book

Logos' book can be built with mbBook.

This tool can be installed with cargo:

cargo install mdbook

You also need to install mdbook-admonish and its assets:

cargo install mdbook admonish
cd book/  # You must run the next command from the book/ directory
mdbook-admonish install

Then, you can build the book with:

mbook serve book --open

Any change in the ./book folder will automatically trigger a new build, and the pages will be live-reloaded.

Internals

Logos' core functionalities are split across four crates:

  • logos is the main crate, that you add to your project (in Cargo.toml) to obtain the Logos derive macro. The public API is limited to this crate, and most users should only use this crate, not the others.
  • logos-derive is a very simply but necessary crate to expose logos-codegen's code as a derive macro.
  • logos-codegen contains the most technical parts of Logos: the code that reads you tokens definition, and generates optimized code to create blazingly fast lexers. You can read a blog post from the author of Logos to get a small insight of what the logos-codegen crate does. In the future, we hope to provide more documents about how this crate works, so people are more likely to understand it and improve it with pull requests (see the Contributing section).
  • logos-cli is a separate crate, that installs a binary of the same name, and allows to expand the Logos derive macro into code. It can be installed with cargo install logos-cli, and usage help can be obtained through the logos-cli --help command. This tool can be useful if your tokens definition stays is constant, and you want to reduce compilatio time overhead caused by derive macros.
  • logos-fuzz is an internal crate that uses afl.rs to find confusing panics before they reach the developer. To use this tool, see the Fuzzing guide

Fuzzing

Fuzzing is a technique to test a piece of software by injecting randomly generated inputs. This can be pretty useful to discover bugs, as pointed out in #407.

Logos' fuzzing crate is powered by afl.rs that finds panics in Logos' methods.

Usage

First, make sure you have cargo-afl installed, see the rust-fuzz afl setup guide for installation information.

Next, change your current working directory to be the fuzz folder.

Building

Before fuzzing, you need to build the target with:

cargo afl build

Fuzzy testing

The recommended way the run tests is with:

cargo afl fuzz -i in -o out ../target/debug/logos-fuzz

Note that it may run for a (very) long time before it encounter any bug.

Replaying a Crash

If you happen to find a bug that crashes the program, you can reply it with

cargo afl run logos-fuzz < out/default/crashes/crash_file

Reporting a Bug

If you encounter a crash and you feel the error message is not appropriate, please report it by opening an issue. Don't forget to include your crash file so we can later reproduce it.