Unicode support

By default, logos is unicode aware. It accepts input in the form of a rust &str that is valid UTF-8 and it compiles its regular expressions to match unicode codepoints. When it returns spans for tokens, these spans are guaranteed to not split utf-8 codepoints. These behaviors can all be changed, however.

Using `&[u8]` input

The easiest thing to change is how logos accepts an input. By adding the #[logos(utf8 = false)] attribute to your token enum, you instruct logos to accept a byte slice for input instead. This, by itself, doesn’t change matching behavior at all. The regular expressions are all still compiled with unicode support, . matching a single character rather than a byte, etc. If all you did was add that attribute and you called the lexer with Token::lexer(input.as_bytes()), then you would get the exact same output as before.

Matching bytes rather than Unicode codepoints

If you want to ignore unicode altogether and match ascii, raw bytes, or whatever esoteric character encoding you want, you can compile your regular expressions with unicode mode off. This can be done by either removing the unicode flag manually with (?-u) in your regular expression, or if you supply the pattern as a byte string, like #[regex(b"my.*pattern")] then logos will turn off the flag for you. See the regex docs for more information.

Logos will automatically detect if any of your patterns can match a byte sequence that is invalid utf8. If one exists and you haven’t set the lexer to use &[u8] input, it will issue a compile error.

Keyboard shortcuts

Logos Handbook

Unicode support

Using &[u8] input

Matching bytes rather than Unicode codepoints

Using `&[u8]` input