pest: parsing in Rust
- 7 minutes read - 1376 wordsA Microsoft engineer introduced me to pest
as a way to introduce service filtering in a ZeroConf plugin that I’m prototyping for Akri. It’s been fun to learn but I worry that, because I won’t use it frequently, I’m going to quickly forget what I’ve done. So, here are my notes.
Here’s the problem, I’d like to be able to provide users of the ZeroConf plugin with a string-based filter that permits them to filter the services discovered when the Akri agent browses a network.
I originally used zeroconf
but am now exploring astro-dnssd
for ZeroConf browsing. Here’s a list of results from the former:
{
name: "freddie",
kind: "_rust._tcp",
domain: "local",
host_name: "freddie.local",
address: "192.168.100.15",
port: 8080,
txt: Some(AvahiTxtRecord(UnsafeCell))
}
And then browsing using Avahi:
avahi-browse --all
+ enp5s0 IPv6 freddie _rust._tcp local
And so, a filter may contain some subset of the above, e.g.:
name="freddie" domain="local" kind="_rust._tcp" port="8080"
For now, I’m focusing on constant values but, in practice, this should support wildcards e.g. name="fred*"
too.
I’m using 2 crates: pest
and pest_derive
. I started writing code but the pest.rs
site has an in-browser editor which is excellent.
NOTE The “share” button is not working on the editor.
The documentation is decent but special thanks to @rtyler on Gitter for pointing me to the pest_derive
documentation which helped me greatly.
I may have over-complicated my grammar but here is the current version (feedback always welcome):
./src/zeroconf.pest
:
// ZeroConf: Service Filtering
name = { ASCII_ALPHANUMERIC ~ ( ( ASCII_ALPHANUMERIC ~ HYPHEN ~ ASCII_ALPHANUMERIC ) | ASCII_ALPHANUMERIC )* }
full_name = _{ "name=\"" ~ name ~ "\"" }
domain = { ASCII_ALPHA_LOWER+ }
full_domain = _{ "domain=\"" ~ domain ~ "\"" }
host_name = { name ~ "." ~ domain }
full_host_name = _{ "host_name=\"" ~ host_name ~ "\"" }
tcp = { "tcp" }
udp = { "udp" }
sctp = { "sctp" }
protocol = { tcp | udp | sctp }
full_protocol = _{ "_" ~ protocol }
stype = { ASCII_ALPHA_LOWER+ }
full_stype = _{ "_" ~ stype }
kind = { full_stype ~ "." ~ full_protocol }
full_kind = _{ "kind=\"" ~ kind ~ "\"" }
port = { ASCII_DIGIT{1,5} }
full_port = _{ "port=\"" ~ port ~ "\"" }
term = _{ full_kind | full_domain | full_name | full_port | full_host_name }
filter = { term ~ ( SPACE_SEPARATOR* ~ term )* }
I won’t cover each rule but I’ll note some learnings:
- DNS names represented here by
name
are limited in length (not expressed by the rule) and are alphanumeric but may include hyphens. Implicitly, hyphens may not be repeated (i.e. no--
) and must occur between alphanumeric characters. - The
term
s infilter
s all take the formidentifier="value"
but, once the identifier is matched e.g.domain="..."
, the identifier can be discarded.pest
determines thatdomain="..."
matchesRule::full_domain
and from this we can grabRule::domain
which is what we really want. - Here’s a neat feature of pest called Silent modifiers (link). By prefixing the definition of rule
full_domain
with an underscorefull_domain = _{ ... }
, the rule is used but the result tree drops references to Silent rules. In this casefull_domain
. I’ll provide an example below. - The rule for
port
is inexact. This corresponds to TCP ports which are unsigned 16-bit integers (0:65535
). This rule permits77777
which isn’t permissible but it prohibits123456
which isn’t permissible. I think the range-checking needs to be done in code. - The rule for
filter
defines repeatedterm
but it does not preclude repeating the same term. This could result in a filterkind="_rust._tcp" kind="_http._tcp"
which is contradcitory. I feel this is my inexperience with the grammar. I’ll have to deal with this in code. - The rule for
filter
also permits repeated spaces between terms.kind="_rust.tcp" domain="local"
is permitted.
To derive a parser, we just need to reference zeroconf.pest
:
use pest::Parser;
use pest_derive::*;
#[derive(Parser)]
#[grammar = "zeroconf.pest"]
pub struct ZeroConfParser;
And then, per the documentation, we could:
let example = "name=\"hades-canyon\" domain=\"local\" kind=\"_rust._tcp\" port=\"8080\" host_name=\"hades-canyon.local\"";
let filter = ZeroConfParser::parse(Rule::filter, example);
println!("{:?}", filter);
But, I’ve chosen to build tests for each of the rules, i.e.:
#[cfg(test)]
mod tests {
use super::ZeroConfParser;
use super::*;
use lazy_static::*;
use pest::{consumes_to, parses_to};
const NAME: &str = "freddie";
const TCP: &str = "tcp";
const RUST: &str = "rust";
const DOMAIN: &str = "local";
const PORT: &str = "8080";
lazy_static! {
static ref FULL_NAME: String = format!("name=\"{}\"", NAME);
static ref KIND: String = format!("_{}._{}", RUST, TCP);
static ref FULL_KIND: String = format!("kind=\"{}\"", *KIND);
static ref FULL_DOMAIN: String = format!("domain=\"{}\"", DOMAIN);
}
#[test]
fn test_filter_domain_kind_name() {
let filter: String = format!("{} {} {}", *FULL_DOMAIN, *FULL_KIND, *FULL_NAME);
println!("{:?} [{}]", filter, filter.len());
parses_to! {
parser:ZeroConfParser,
input:&filter,
rule:Rule::filter,
// 000000000011111111112222222222333333333344444444
// 012345678901234567890123456789012345678901234567
// domain="local" kind="_rust._tcp" name="freddie"
tokens:[
filter(0,47,[
domain(8,13),
kind(21,31,[
stype(22,26),
protocol(28,31,[
tcp(28,31)
])
]),
name(39,46)
])
]
};
}
}
So, how does this work? We leverage the parses_to!
macro, give it the parser, some input (filter
) and then a set of tokens
. The tokens
is the hard part. Please see the comment string above the tokens
. This helps work out what’s where and why. The filter
rule yields a filter
that begins at 0
and ends at 47
. The last character of the filter is at 46
but the filter ends at 47
.
If we return to the rules, the filter
rule is defined filter = { term ~ ( SPACE_SEPARATOR* ~ term )* }
. So, we should get at least one (and possibly an infinite list of space-separated) terms. If you look at the code below, you’ll see that its token
includes filter
and then includes 3 terms
: term(0,14...)
, term(15,32,...)
and term(33,47...)
. But, this is just complexity, what we really want to know is that these 3 term
s are: domain(...)
, kind(...)
and name
. By adding the silent modifier to the definition of term
, this is what results in the above code.
However, this adds some complexity because you’ll note that e.g. domain(8,13)
. This is because the silent term was domain="local"
but we’ve silently ignored domain="
(don’t forget the "
) and the terminating "
. So, instead of the dropped term(0,14,...)
corresponding to domain="local"
we only have domain(8,13)
corresponding to local
. Nice!!
I think a further optimisim may be to drop protocol = { tcp | udp | sctp }
and have just protocol = { "tcp" | "udp" | "sctp" }
but, I’m ignoring that for now.
This test begins to show the benefit of using the Silent modifier on some rules. See below for the same test (and the regular code would need to navigate this entire hierarchy too) if the rules don’t include the silent modifier:
#[test]
fn test_filter_domain_kind_name() {
let filter: String = format!("{} {} {}", *FULL_DOMAIN, *FULL_KIND, *FULL_NAME);
println!("{:?} [{}]", filter, filter.len());
parses_to! {
parser:ZeroConfParser,
input:&filter,
rule:Rule::filter,
// 000000000011111111112222222222333333333344444444
// 012345678901234567890123456789012345678901234567
// domain="local" kind="_rust._tcp" name="freddie"
tokens:[
filter(0,47,[
term(0,14,[
full_domain(0,14,[
domain(8,13)
])
]),
term(15,32,[
full_kind(15,32,[
kind(21,31,[
full_stype(21,26,[
stype(22,26)
]),
full_protocol(27,31,[
protocol(28,31,[
tcp(28,31)
])
])
])
])
]),
term(33,47,[
full_name(33,47,[
name(39,46)
])
])
])
]
};
}
One additional advantage of the silent modifier is that you need fewer tests because, you only need to test e.g. domain
not full_domain
and domain
. You can imagine this ripples through the grammar, not only saves time but makes the code easier to
“parse” as a human too. See what I did there!?
Ok, the next step is to begin to do something with the parser beyond just testing it. Once again, the documentation provides an example. Remember, what we’d like is to give the parser some text and get back a set of terms. We can do this:
fn main() {
let example = "name=\"freddie\" domain=\"local\" kind=\"_rust._tcp\" port=\"8080\" host_name=\"freddie.local\"";
let filter = ZeroConfParser::parse(Rule::filter, example).unwrap_or_else(|e| panic!("{}", e));
for terms in filter {
for term in terms.into_inner() {
match term.as_rule() {
Rule::name => println!("Name: {}", term.as_str()),
Rule::kind => {
println!("Kind: {}", term.as_str());
for i3p in term.into_inner() {
match i3p.as_rule() {
Rule::stype => {
println!("Type: {}", i3p.as_str())
}
Rule::protocol => {
println!("Protocol: {}", i3p.as_str())
}
_ => unreachable!(),
}
}
}
Rule::domain => println!("Domain: {}", term.as_str()),
Rule::port => println!("Port: {}", term.as_str()),
Rule::host_name => println!("Hostname: {}", term.as_str()),
_ => unreachable!(),
}
}
}
}
And:
cargo run
Yields:
Name: freddie
Domain: local
Kind: _rust._tcp
Type: rust
Protocol: tcp
Port: 8080
Hostname: freddie.local
That’s all (for now).