Packrat Parsing from Scratch

https://blog.bruce-hill.com/packrat-parsing-from-scratch

Packrat
Parsing from Scratch

How to implement a packrat parser from scratch, one easy piece at a time.

Estimated reading time: 12 mins

June 5, 2021

Packrat parsing is a relatively new approach to parsing code introduced by Bryan Ford in his 2002 Master’s thesis. Packrat parsers are capable of very efficiently parsing a class of grammars called Parsing Expression Grammars (PEGs). I covered PEGs in more depth in my previous post, PEGs and the Structure of Languages, so if you’re new to the topic, I recommend you begin there. The packrat parser is based on earlier work on recursive descent parsers, dating back to the 70s, but focused specifically on Parsing Expression Grammars. In this post, I’ll attempt to demystify packrat parsers by walking through a complete packrat parser implementation that fits in a few dozen lines of Javascript code.

The core idea of the packrat parsing algorithm is incredibly simple. It boils down to: when you attempt to match a rule at a particular position, remember whether it succeeds, and if so, how much input it consumes. Because PEGs have no context-dependencies, if a pattern matches in a particular place in a string, it will always match there, exactly the same, regardless of where it fits into a larger structure. This means there’s a fixed number of combinations of grammar rules to match and string positions to match them at. There’s no need to try matching the same pattern in different ways at the same position. Here, you can see what the parsing algorithm looks like in action (click “Run” to run the demo):

Making a Packrat Parser

The architecture I’m going to use is a functional style that represents grammar rules as functions that will either match an input string or not, and return some information if it does match. This makes the implementation easy, but a more efficient approach might be to use a virtual machine. We’re going to start with a simple container class that stores info about the match:

Here, start is inclusive and end is exclusive, i.e. start is the index where the match begins, and end is the index where subsequent matches would begin. A zero-length match would have start == end.

class Match {
  constructor(text, start, end, children=[]) {
    this.text = text;
    this.start = start;
    this.end = end;
    this.children = children;
  }
}

Pattern Building Blocks

Next, let’s begin implementing support for matching different pattern types. The first and simplest pattern is one that matches literal text. The literal pattern will check whether the exact string match occurs at the starting string position:

In Javascript, functions implicitly return undefined if they run to the end without an explicit return statement, so that will represent “no match.”

function literal(pattern) {
  return ((text, i)=> {
    if (text.startsWith(pattern, i))
      return new Match(text, i, i+pattern.length);
  });
}

Using this, we can perform simple string matching:

var HelloPattern = literal("hello");
HelloPattern("hello world", 0); // -> new Match(text, 0, 5))
HelloPattern("goodbye", 0); // -> no match (undefined)
HelloPattern("hello world", 5); // -> no match at index 5

Great, that works! Let’s get to the meatier patterns. The next pattern is a chain or sequence of other patterns one after the other:

function chain(...patterns) {
  return ((text, i)=> {
    let children = [], start = i;
    for (let pattern of patterns) {
      let result = pattern(text, i);
      if (!result) return;
      children.push(result);
      i = result.end;
    }
    return new Match(text, start, i, children);
  });
}

We can use it like this:

var HelloWorldPattern = chain(
  literal("hello"), literal(" "), literal("world"));
HelloWorldPattern("hello world", 0); // -> new Match(text, 0, 11))
HelloWorldPattern("goodbye", 0); // -> undefined

Now, let’s get to the first really interesting pattern: a branching choice option. This will walk through a list of patterns one at a time, and return the results of the first one that succeeds:

function oneof(...patterns) {
  return ((text, i)=> {
    for (let pattern of patterns) {
      let result = pattern(text, i);
      if (result) return result;
    }
  });
}

We can use it like this:

var AddressWorldPattern = chain(
  oneof(literal("hello"), literal("goodbye")),
  literal(" "), literal("world"));
AddressWorldPattern("hello world", 0); // -> new Match(text, 0, 11))
AddressWorldPattern("goodbye world", 0); // -> new Match(text, 0, 13)
AddressWorldPattern("hello sky", 0); // -> undefined

At this point, we need a way to match repeating patterns. There’s two ways to do that: looping and recursion. I’m going to choose recursion here because it will make it easier to have a fine-grained cache in the future (this will have one recursive function call per iteration). The approach for recursion is to recursively define “0 or more repetitions of a pattern” to mean: either the pattern followed by 0 or more repetitions of the pattern, or the empty string (in PEG syntax: repeating <- (pat repeating) / ""). Because PEGs formally specify parsing order, the empty string will only match if “one or more repetitions” doesn’t match:

The code here uses a lazy lookup of zero_or_more using a function closure. This is a little hacky, but later, we will implement a less hacky way to do this.

function repeat(pattern) {
  let zero_or_more = oneof(
      chain(pattern, (text,i)=>zero_or_more(text,i)),
      literal(""));
  return zero_or_more
}

These four primitive pattern types (literals, chains, options, and repeating) are incredibly powerful, and you can define a huge range of PEG grammars in with only these tools, but there’s one thing missing: negation. Negation will allow us to specify that something shouldn’t match a pattern (e.g. variables can’t be keywords). A successful negation will return a zero-length result that means “the pattern I’m negating doesn’t match at my starting position:”

function not(pattern) {
  return ((text, i) => {
    if (!pattern(text, i))
      return new Match(text,i,i);
  });
}

Now, the last component here isn’t strictly necessary, but it’s very handy to have a wildcard pattern that matches any single character, the equivalent of . in regular expressions. You could achieve the same goal by creating a oneof() that has a literal() corresponding to every character, but that’s just tedious.

function anychar() {
  return ((text, i)=> {
    if (i < text.length)
      return new Match(text, i, i+1);
  });
}

A Simple CSV Parser

With these functions, we can implement a parser for a full language! Let’s try it out with a CSV parser (comma separated values) to see what it will look like:

// Equivalent to the regex: [^,\n]*
let CSVItem = repeat(chain(not(","), not("\n"), anychar()));
// Sorta like: CSVItem ("," CSVItem)*
let CSVLine = chain(CSVItem, repeat(chain(",", CSVItem)));
// Sorta like: CSVLine ("\n" CSVLine)*
let CSVFile = chain(CSVLine, repeat(chain("\n", CSVLine)));

// It works (but only for validation)
let my_file = "x,y,z\na,b,c";
Console.assert(CSVFile(my_file)) // new Match(...)

Capturing Data

Right now, the parsing framework is very flexible, but it’s not very useful because the return value it produces is just a bunch of string indices. Let’s add the ability to explicitly capture snippets of the source text and the ability to tag a match with a descriptive name.

function capture(pattern) {
  return (text, i)=> {
    let result = pattern(text, i);
    if (result) {
      let match = new Match(
        text, result.start, result.end, [result]);
      match.captured = text.substring(i, result.end);
      return match;
    }
  };
}

function named(name, pattern) {
  return (text, i)=> {
    let result = pattern(text, i);
    if (result) {
      let match = new Match(text, i, result.end, [result]);
      match.name = name;
      return match;
    }
  };
}

Now we can define our earlier grammar with captures:

let CSVItem = named("Item",
  capture(repeat(chain(not(literal(','),
    not(literal("\n")), anychar()))));
let CSVLine = named("Line",
  chain(CSVItem, repeat(chain(literal(','), CSVItem))))
let CSVFile = named("File",
  chain(CSVLine, repeat(chain(literal("\n"), CSVLine))))

This finally gets us to a usable result! Here’s a program that uses our CSV parser to print the sum of the numbers on each row.

let my_file = "1,2,3\n4,5,6"
let csv_data = CSVFile(my_file):

// Extract the captured values from the tree structure
// of chain()s and oneof()s and so forth:
function* get_captures(n) {
  if (n.captured) yield n.captured;
  for (let child of n.children) {
    yield from get_captures(child);
  }
}

// Print the sum of the numbers on each row:
for (let row of csv_data.children) {
  let total = 0;
  for (let item_text of get_captures(row)) {
    total += parseInt(item_text);
  }
  console.log(total); // First logs 6, then 15
}

Recursion

Earlier, when we defined the repeat rule, we had to use a somewhat hacky version of recursion that relied on local variables and function closures. It worked, but it wasn’t clean. The way to solve this more elegantly is to use lazy rule lookups on a single grammar object:

class Grammar {
  ref(name) {
    return (text, i) => this[name](text, i);
  }
}

Using this object, we can implement recursive (and corecursive) definitions by using grammar.ref() instead of referencing the rule directly, and the lookup will be delayed until after the rule is defined:

let g = new Grammar();
g.A = chain("A", oneof(g.ref("A"), g.ref("B"), literal("")));
g.B = chain("B", oneof(g.ref("B"), g.ref("A"), literal("")));
g.AB = oneof(g.A, g.B);
Console.assert(g.AB("ABBBAAABA", 0));

Some Conveniences

There’s also a few other helper functions that provide some features you might expect or desire to have, like lookaheads (&patt), optionals (patt?), character ranges ([a-z]), character sets ([aeiou]), and inverted character sets ([^,;]). All of these can be implemented in terms of the previously defined functions, but they’re common enough that it makes sense to make slightly more optimized versions of them and provide an API to access them easily.

// Equivalent to not(not(pattern))
function ahead(pattern) {
  return (text,i)=> (pattern(text, i) && new Match(text, i,i));
}

// Equivalent to oneof(pattern, literal(""))
function maybe(pattern) {
  return (text,i)=> (pattern(text, i) || new Match(text, i,i));
}

// Equivalent to oneof(literal(min), ..., literal(max))
function between(min, max) {
  return ((text, i)=> {
    if (i < text.length && min <= text[i] && text[i] <= max)
      return new Match(text, i, i+1);
  });
}

// Equivalent to oneof(literal(allowed[0], allowed[1], ...))
// or chain(not(oneof(...)), anychar()) when inclusive=false
function charfrom(allowed, inclusive=true) {
  if (!inclusive)
    return chain(not(charfrom(allowed)), anychar());
  return oneof(...allowed);
}

All right! That’s it. In total, it’s about 100 lines of code for a full parsing library that can parse any arbitrary PEG. Unfortunately, it’s kinda slow because it’s not a proper packrat parser. The defining feature of packrat parsers is that they memoize everything, which is what gives them their amazing speed.

Packrat Parsing

We can make this code orders of magnitude faster with 10 lines of code, and one tiny tweak. All it takes adding caching to the definitions:

function cached(fn) {
  let cache = {}, cache_text = "";
  return (text, i)=> {
    if (text != cache_text) // cache is per-input-text
      cache = {}, cache_text = text;
    if (!(i in cache))
      cache[i] = fn(text, i);
    return cache[i];
  }
}

And then, wrap all the returned functions with cached(). For example, with literal():

function literal(pattern) {
  return cached((text, i)=> {
    if (text.startsWith(pattern, i))
      return new Match(text, i, i+pattern.length);
  });
}

That’s it! This is a complete packrat parser! As you’ve seen, the fundamental components of a packrat parser are individually quite simple, and there aren’t very many of them. At its heart, a packrat parser is nothing more than a memoized matcher of literals, chains, choices, negations, and recursive references. There’s nothing that’s fundamentally difficult to understand, just a simple idea that can be implemented in a few dozen lines of code and endlessly tweaked and improved on.

Future Improvements

The parser that I’ve walked you through implementing in this post is asymptotically quite fast. It runs in O(GN)O(G N) time, where GG is the size of the grammar and NN is the size of input text; in other words, for a fixed grammar, you will get θ(N)θ(N) performance for arbitrary input text. It will also use O(GN)O(G N) memory space (θ(N)θ(N) for a fixed grammar). For comparison, packrat parsers are capable of matching recursive grammars that are impossible to match with extended regular expressions, but can do so in linear time, while extended regular expressions have potentially unbounded runtime. The downside is that packrat parsers use an amount of memory proportional to the length of the input text. Packrat parser runtime may be close to θ(N)θ(N) for arbitrary real-world (non-malicious) grammars, since grammars tend to follow power-law distributions, where a few rules are called often, but the rest are only rarely called. When this is the case, the expected number of rules tested at each string index is bounded by a constant number that doesn’t grow as the grammar adds more rules. However, the implementation here has a lot of constant-time overhead on common operations and it can potentially use quite a lot of stack space. Using a virtual machine with a heap-allocated stack instead of recursive function calls will lower function call overhead and prevent stack overflow, but the overall performance would need to be tuned using a profiler to find performance bottlenecks if you truly cared about speed. In other words, this blog post is intended to give you an understanding of the concepts behind a packrat parser, but the code here won’t outperform a real parsing library.

Besides performance, the programming ergonomics could also use a lot of improvement. The code repeat(charfrom('aeiou'))(input_text, 0) is much more cumbersome than regex pattern matching: input_text.match(/[aeiou]*/). Fortunately, the regex/BNF-like PEG grammar discussed in PEGs and the structure of languages is itself a grammar that can be parsed by a packrat parser. In my next post, I’ll discuss how to write a program that automatically converts PEG-syntax grammars into runnable parsers–in other words, how to make a parser-generator. As a quick teaser, here is how we might start building it:

let g = new Grammar();
g.AnyChar = named("AnyChar", literal("."));
g.CharSet = named("CharSet", chain(literal("["),
  repeat(not(literal("]")), capture(anychar())),
  literal("]")));
// ...
g.Expression = oneof(
  g.ref("AnyChar"), g.ref("CharSet"), ...);

function make_parser(grammar_text) {
  let match = g.Expression(grammar_text, 0);
  if (!match) throw "Grammar failed to compile";
  switch (match.name) {
    case "AnyChar": return anychar();
    case "CharSet": return charfrom(get_captures(match));
    //...
  }
}
function match(grammar, text, offset=0) {
  return make_parser(grammar)(text, offset)
}

let result = match("[aeiou]", input_text);

Look forward to a new post coming soon!

{
"by": "sph",
"descendants": 0,
"id": 40246751,
"score": 2,
"time": 1714738128,
"title": "Packrat Parsing from Scratch",
"type": "story",
"url": "https://blog.bruce-hill.com/packrat-parsing-from-scratch"
}
{
"author": "Bruce Hill",
"date": "2021-06-05T12:00:00.000Z",
"description": "How to implement a packrat parser from\nscratch, one easy piece at a time.",
"image": "https://blog.bruce-hill.com/media/packrat-parsing/parse-tree.png",
"logo": "https://logo.clearbit.com/bruce-hill.com",
"publisher": "Bruce Hill",
"title": "Packrat Parsing from Scratch",
"url": "https://blog.bruce-hill.com/packrat-parsing-from-scratch"
}
{
"url": "https://blog.bruce-hill.com/packrat-parsing-from-scratch",
"title": "Packrat Parsing from Scratch",
"description": "How to implement a packrat parser from scratch, one easy piece at a time. Estimated reading time: 12 mins June 5, 2021 Packrat parsing is a relatively new approach to parsing code introduced by Bryan Ford in...",
"links": [
"https://blog.bruce-hill.com/packrat-parsing-from-scratch"
],
"image": "https://blog.bruce-hill.com/media/packrat-parsing/parse-tree.png",
"content": "<div>\n<article>\n<p><img src=\"https://blog.bruce-hill.com/media/packrat-parsing/parse-tree.png\" alt=\"Packrat\nParsing from Scratch\" />\n</p>\n<p>How to implement a packrat parser from scratch, one easy\npiece at a time.</p>\n<p>Estimated reading time: 12 mins</p>\n<span>June 5, 2021</span>\n<p>Packrat parsing is a relatively new approach to parsing code introduced by\nBryan Ford in <a href=\"https://pdos.csail.mit.edu/~baford/packrat/thesis/\" target=\"_blank\">his 2002 Master’s thesis.</a> Packrat parsers\nare capable of very efficiently parsing a class of grammars called Parsing\nExpression Grammars (PEGs). I covered PEGs in more depth in my previous post, <a target=\"_blank\" href=\"https://blog.bruce-hill.com/pegs-and-the-structure-of-languages\">PEGs and the Structure of\nLanguages</a>, so if you’re new to the topic, I recommend you begin there. The\npackrat parser is based on earlier work on recursive descent parsers, dating\nback to the 70s, but focused specifically on Parsing Expression Grammars. In\nthis post, I’ll attempt to demystify packrat parsers by walking through a\ncomplete packrat parser implementation that fits in a few dozen lines of\nJavascript code.</p>\n<p>The core idea of the packrat parsing algorithm is incredibly simple. It boils\ndown to: when you attempt to match a rule at a particular position, remember\nwhether it succeeds, and if so, how much input it consumes. Because PEGs have no\ncontext-dependencies, if a pattern matches in a particular place in a string, it\nwill <em>always</em> match there, exactly the same, regardless of where it fits\ninto a larger structure. This means there’s a fixed number of combinations of\ngrammar rules to match and string positions to match them at. There’s no need to\ntry matching the same pattern in different ways at the same position. Here, you\ncan see what the parsing algorithm looks like in action (click “Run” to run the\ndemo):</p>\n<h2 id=\"making-a-packrat-parser\">Making a Packrat Parser</h2>\n<p>The architecture I’m going to use is a functional style that represents\ngrammar rules as functions that will either match an input string or not, and\nreturn some information if it does match. This makes the implementation easy,\nbut a more efficient approach might be to use a virtual machine. We’re going to\nstart with a simple container class that stores info about the match:</p>\n<p><span>Here, <code>start</code> is <em>inclusive</em> and\n<code>end</code> is <em>exclusive</em>, i.e. <code>start</code> is the index\nwhere the match begins, and <code>end</code> is the index where subsequent\nmatches would begin. A zero-length match would have\n<code>start == end</code>.</span></p>\n<div><pre><code><span><span>class</span> Match {</span>\n<span> <span>constructor</span>(text<span>,</span> start<span>,</span> end<span>,</span> children<span>=</span>[]) {</span>\n<span> <span>this</span><span>.</span><span>text</span> <span>=</span> text<span>;</span></span>\n<span> <span>this</span><span>.</span><span>start</span> <span>=</span> start<span>;</span></span>\n<span> <span>this</span><span>.</span><span>end</span> <span>=</span> end<span>;</span></span>\n<span> <span>this</span><span>.</span><span>children</span> <span>=</span> children<span>;</span></span>\n<span> }</span>\n<span>}</span></code></pre></div>\n<h2 id=\"pattern-building-blocks\">Pattern Building Blocks</h2>\n<p>Next, let’s begin implementing support for matching different pattern types.\nThe first and simplest pattern is one that matches literal text. The\n<code>literal</code> pattern will check whether the exact string match occurs at\nthe starting string position:</p>\n<p><span>In Javascript, functions implicitly return\n<code>undefined</code> if they run to the end without an explicit\n<code>return</code> statement, so that will represent “no match.”</span></p>\n<div><pre><code><span><span>function</span> <span>literal</span>(pattern) {</span>\n<span> <span>return</span> ((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>if</span> (text<span>.</span><span>startsWith</span>(pattern<span>,</span> i))</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span> i<span>+</span>pattern<span>.</span><span>length</span>)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>Using this, we can perform simple string matching:</p>\n<div><pre><code><span><span>var</span> HelloPattern <span>=</span> <span>literal</span>(<span>\"hello\"</span>)<span>;</span></span>\n<span><span>HelloPattern</span>(<span>\"hello world\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; new Match(text, 0, 5))</span></span>\n<span><span>HelloPattern</span>(<span>\"goodbye\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; no match (undefined)</span></span>\n<span><span>HelloPattern</span>(<span>\"hello world\"</span><span>,</span> <span>5</span>)<span>;</span> <span>// -&gt; no match at index 5</span></span></code></pre></div>\n<p>Great, that works! Let’s get to the meatier patterns. The next pattern is a\nchain or sequence of other patterns one after the other:</p>\n<div><pre><code><span><span>function</span> <span>chain</span>(<span>...</span>patterns) {</span>\n<span> <span>return</span> ((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>let</span> children <span>=</span> []<span>,</span> start <span>=</span> i<span>;</span></span>\n<span> <span>for</span> (<span>let</span> pattern <span>of</span> patterns) {</span>\n<span> <span>let</span> result <span>=</span> <span>pattern</span>(text<span>,</span> i)<span>;</span></span>\n<span> <span>if</span> (<span>!</span>result) <span>return</span><span>;</span></span>\n<span> children<span>.</span><span>push</span>(result)<span>;</span></span>\n<span> i <span>=</span> result<span>.</span><span>end</span><span>;</span></span>\n<span> }</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span> start<span>,</span> i<span>,</span> children)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>We can use it like this:</p>\n<div><pre><code><span><span>var</span> HelloWorldPattern <span>=</span> <span>chain</span>(</span>\n<span> <span>literal</span>(<span>\"hello\"</span>)<span>,</span> <span>literal</span>(<span>\" \"</span>)<span>,</span> <span>literal</span>(<span>\"world\"</span>))<span>;</span></span>\n<span><span>HelloWorldPattern</span>(<span>\"hello world\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; new Match(text, 0, 11))</span></span>\n<span><span>HelloWorldPattern</span>(<span>\"goodbye\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; undefined</span></span></code></pre></div>\n<p>Now, let’s get to the first really interesting pattern: a branching choice\noption. This will walk through a list of patterns one at a time, and return the\nresults of the first one that succeeds:</p>\n<div><pre><code><span><span>function</span> <span>oneof</span>(<span>...</span>patterns) {</span>\n<span> <span>return</span> ((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>for</span> (<span>let</span> pattern <span>of</span> patterns) {</span>\n<span> <span>let</span> result <span>=</span> <span>pattern</span>(text<span>,</span> i)<span>;</span></span>\n<span> <span>if</span> (result) <span>return</span> result<span>;</span></span>\n<span> }</span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>We can use it like this:</p>\n<div><pre><code><span><span>var</span> AddressWorldPattern <span>=</span> <span>chain</span>(</span>\n<span> <span>oneof</span>(<span>literal</span>(<span>\"hello\"</span>)<span>,</span> <span>literal</span>(<span>\"goodbye\"</span>))<span>,</span></span>\n<span> <span>literal</span>(<span>\" \"</span>)<span>,</span> <span>literal</span>(<span>\"world\"</span>))<span>;</span></span>\n<span><span>AddressWorldPattern</span>(<span>\"hello world\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; new Match(text, 0, 11))</span></span>\n<span><span>AddressWorldPattern</span>(<span>\"goodbye world\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; new Match(text, 0, 13)</span></span>\n<span><span>AddressWorldPattern</span>(<span>\"hello sky\"</span><span>,</span> <span>0</span>)<span>;</span> <span>// -&gt; undefined</span></span></code></pre></div>\n<p>At this point, we need a way to match repeating patterns. There’s two ways to\ndo that: looping and recursion. I’m going to choose recursion here because it\nwill make it easier to have a fine-grained cache in the future (this will have\none recursive function call per iteration). The approach for recursion is to\nrecursively define “0 or more repetitions of a pattern” to mean: either the\npattern followed by 0 or more repetitions of the pattern, or the empty string\n(in PEG syntax: <code>repeating <span>&lt;-</span> (pat repeating) <span>/</span> <span>\"\"</span></code>).\nBecause PEGs formally specify parsing order, the empty string will only match if\n“one or more repetitions” doesn’t match:</p>\n<p><span>The code here uses a lazy lookup of\n<code>zero_or_more</code> using a function closure. This is a little hacky, but\nlater, we will implement a less hacky way to do this.</span></p>\n<div><pre><code><span><span>function</span> <span>repeat</span>(pattern) {</span>\n<span> <span>let</span> zero_or_more <span>=</span> <span>oneof</span>(</span>\n<span> <span>chain</span>(pattern<span>,</span> (text<span>,</span>i)<span>=&gt;</span><span>zero_or_more</span>(text<span>,</span>i))<span>,</span></span>\n<span> <span>literal</span>(<span>\"\"</span>))<span>;</span></span>\n<span> <span>return</span> zero_or_more</span>\n<span>}</span></code></pre></div>\n<p>These four primitive pattern types (literals, chains, options, and repeating)\nare incredibly powerful, and you can define a huge range of PEG grammars in with\nonly these tools, but there’s one thing missing: negation. Negation will allow\nus to specify that something <em>shouldn’t</em> match a pattern (e.g. variables\ncan’t be keywords). A successful negation will return a zero-length result that\nmeans “the pattern I’m negating doesn’t match at my starting position:”</p>\n<div><pre><code><span><span>function</span> <span>not</span>(pattern) {</span>\n<span> <span>return</span> ((text<span>,</span> i) <span>=&gt;</span> {</span>\n<span> <span>if</span> (<span>!</span><span>pattern</span>(text<span>,</span> i))</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span>i<span>,</span>i)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>Now, the last component here isn’t strictly necessary, but it’s very handy to\nhave a wildcard pattern that matches any single character, the equivalent of\n<code>.</code> in regular expressions. You could achieve the same goal by\ncreating a <code>oneof()</code> that has a <code>literal()</code> corresponding\nto every character, but that’s just tedious.</p>\n<div><pre><code><span><span>function</span> <span>anychar</span>() {</span>\n<span> <span>return</span> ((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>if</span> (i <span>&lt;</span> text<span>.</span><span>length</span>)</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span> i<span>+</span><span>1</span>)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<h2 id=\"a-simple-csv-parser\">A Simple CSV Parser</h2>\n<p>With these functions, we can implement a parser for a full language! Let’s\ntry it out with a CSV parser (comma separated values) to see what it will look\nlike:</p>\n<div><pre><code><span><span>// Equivalent to the regex: [^,\\n]*</span></span>\n<span><span>let</span> CSVItem <span>=</span> <span>repeat</span>(<span>chain</span>(<span>not</span>(<span>\",\"</span>)<span>,</span> <span>not</span>(<span>\"</span><span>\\n</span><span>\"</span>)<span>,</span> <span>anychar</span>()))<span>;</span></span>\n<span><span>// Sorta like: CSVItem (\",\" CSVItem)*</span></span>\n<span><span>let</span> CSVLine <span>=</span> <span>chain</span>(CSVItem<span>,</span> <span>repeat</span>(<span>chain</span>(<span>\",\"</span><span>,</span> CSVItem)))<span>;</span></span>\n<span><span>// Sorta like: CSVLine (\"\\n\" CSVLine)*</span></span>\n<span><span>let</span> CSVFile <span>=</span> <span>chain</span>(CSVLine<span>,</span> <span>repeat</span>(<span>chain</span>(<span>\"</span><span>\\n</span><span>\"</span><span>,</span> CSVLine)))<span>;</span></span>\n<span></span>\n<span><span>// It works (but only for validation)</span></span>\n<span><span>let</span> my_file <span>=</span> <span>\"x,y,z</span><span>\\n</span><span>a,b,c\"</span><span>;</span></span>\n<span><span>Console</span><span>.</span><span>assert</span>(<span>CSVFile</span>(my_file)) <span>// new Match(...)</span></span></code></pre></div>\n<h2 id=\"capturing-data\">Capturing Data</h2>\n<p>Right now, the parsing framework is very flexible, but it’s not very\n<em>useful</em> because the return value it produces is just a bunch of string\nindices. Let’s add the ability to explicitly capture snippets of the source text\nand the ability to tag a match with a descriptive name.</p>\n<div><pre><code><span><span>function</span> <span>capture</span>(pattern) {</span>\n<span> <span>return</span> (text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>let</span> result <span>=</span> <span>pattern</span>(text<span>,</span> i)<span>;</span></span>\n<span> <span>if</span> (result) {</span>\n<span> <span>let</span> match <span>=</span> <span>new</span> <span>Match</span>(</span>\n<span> text<span>,</span> result<span>.</span><span>start</span><span>,</span> result<span>.</span><span>end</span><span>,</span> [result])<span>;</span></span>\n<span> match<span>.</span><span>captured</span> <span>=</span> text<span>.</span><span>substring</span>(i<span>,</span> result<span>.</span><span>end</span>)<span>;</span></span>\n<span> <span>return</span> match<span>;</span></span>\n<span> }</span>\n<span> }<span>;</span></span>\n<span>}</span>\n<span></span>\n<span><span>function</span> <span>named</span>(name<span>,</span> pattern) {</span>\n<span> <span>return</span> (text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>let</span> result <span>=</span> <span>pattern</span>(text<span>,</span> i)<span>;</span></span>\n<span> <span>if</span> (result) {</span>\n<span> <span>let</span> match <span>=</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span> result<span>.</span><span>end</span><span>,</span> [result])<span>;</span></span>\n<span> match<span>.</span><span>name</span> <span>=</span> name<span>;</span></span>\n<span> <span>return</span> match<span>;</span></span>\n<span> }</span>\n<span> }<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>Now we can define our earlier grammar with captures:</p>\n<div><pre><code><span><span>let</span> CSVItem <span>=</span> <span>named</span>(<span>\"Item\"</span><span>,</span></span>\n<span> <span>capture</span>(<span>repeat</span>(<span>chain</span>(<span>not</span>(<span>literal</span>(<span>','</span>)<span>,</span></span>\n<span> <span>not</span>(<span>literal</span>(<span>\"</span><span>\\n</span><span>\"</span>))<span>,</span> <span>anychar</span>()))))<span>;</span></span>\n<span><span>let</span> CSVLine <span>=</span> <span>named</span>(<span>\"Line\"</span><span>,</span></span>\n<span> <span>chain</span>(CSVItem<span>,</span> <span>repeat</span>(<span>chain</span>(<span>literal</span>(<span>','</span>)<span>,</span> CSVItem))))</span>\n<span><span>let</span> CSVFile <span>=</span> <span>named</span>(<span>\"File\"</span><span>,</span></span>\n<span> <span>chain</span>(CSVLine<span>,</span> <span>repeat</span>(<span>chain</span>(<span>literal</span>(<span>\"</span><span>\\n</span><span>\"</span>)<span>,</span> CSVLine))))</span></code></pre></div>\n<p>This finally gets us to a usable result! Here’s a program that uses our CSV\nparser to print the sum of the numbers on each row.</p>\n<div><pre><code><span><span>let</span> my_file <span>=</span> <span>\"1,2,3</span><span>\\n</span><span>4,5,6\"</span></span>\n<span><span>let</span> csv_data <span>=</span> <span>CSVFile</span>(my_file)<span>:</span></span>\n<span></span>\n<span><span>// Extract the captured values from the tree structure</span></span>\n<span><span>// of chain()s and oneof()s and so forth:</span></span>\n<span><span>function</span><span>*</span> <span>get_captures</span>(n) {</span>\n<span> <span>if</span> (n<span>.</span><span>captured</span>) <span>yield</span> n<span>.</span><span>captured</span><span>;</span></span>\n<span> <span>for</span> (<span>let</span> child <span>of</span> n<span>.</span><span>children</span>) {</span>\n<span> <span>yield</span> <span>from</span> <span>get_captures</span>(child)<span>;</span></span>\n<span> }</span>\n<span>}</span>\n<span></span>\n<span><span>// Print the sum of the numbers on each row:</span></span>\n<span><span>for</span> (<span>let</span> row <span>of</span> csv_data<span>.</span><span>children</span>) {</span>\n<span> <span>let</span> total <span>=</span> <span>0</span><span>;</span></span>\n<span> <span>for</span> (<span>let</span> item_text <span>of</span> <span>get_captures</span>(row)) {</span>\n<span> total <span>+=</span> <span>parseInt</span>(item_text)<span>;</span></span>\n<span> }</span>\n<span> <span>console</span><span>.</span><span>log</span>(total)<span>;</span> <span>// First logs 6, then 15</span></span>\n<span>}</span></code></pre></div>\n<h2 id=\"recursion\">Recursion</h2>\n<p>Earlier, when we defined the <code>repeat</code> rule, we had to use a\nsomewhat hacky version of recursion that relied on local variables and function\nclosures. It worked, but it wasn’t clean. The way to solve this more elegantly\nis to use lazy rule lookups on a single grammar object:</p>\n<div><pre><code><span><span>class</span> Grammar {</span>\n<span> <span>ref</span>(name) {</span>\n<span> <span>return</span> (text<span>,</span> i) <span>=&gt;</span> <span>this</span>[name](text<span>,</span> i)<span>;</span></span>\n<span> }</span>\n<span>}</span></code></pre></div>\n<p>Using this object, we can implement recursive (and corecursive) definitions\nby using <code>grammar.ref()</code> instead of referencing the rule directly,\nand the lookup will be delayed until after the rule is defined:</p>\n<div><pre><code><span><span>let</span> g <span>=</span> <span>new</span> <span>Grammar</span>()<span>;</span></span>\n<span>g<span>.</span><span>A</span> <span>=</span> <span>chain</span>(<span>\"A\"</span><span>,</span> <span>oneof</span>(g<span>.</span><span>ref</span>(<span>\"A\"</span>)<span>,</span> g<span>.</span><span>ref</span>(<span>\"B\"</span>)<span>,</span> <span>literal</span>(<span>\"\"</span>)))<span>;</span></span>\n<span>g<span>.</span><span>B</span> <span>=</span> <span>chain</span>(<span>\"B\"</span><span>,</span> <span>oneof</span>(g<span>.</span><span>ref</span>(<span>\"B\"</span>)<span>,</span> g<span>.</span><span>ref</span>(<span>\"A\"</span>)<span>,</span> <span>literal</span>(<span>\"\"</span>)))<span>;</span></span>\n<span>g<span>.</span><span>AB</span> <span>=</span> <span>oneof</span>(g<span>.</span><span>A</span><span>,</span> g<span>.</span><span>B</span>)<span>;</span></span>\n<span><span>Console</span><span>.</span><span>assert</span>(g<span>.</span><span>AB</span>(<span>\"ABBBAAABA\"</span><span>,</span> <span>0</span>))<span>;</span></span></code></pre></div>\n<h2 id=\"some-conveniences\">Some Conveniences</h2>\n<p>There’s also a few other helper functions that provide some features you\nmight expect or desire to have, like lookaheads (<code><span>&amp;</span>patt</code>), optionals\n(<code>patt<span>?</span></code>), character\nranges (<code><span>[a-z]</span></code>),\ncharacter sets (<code><span>[aeiou]</span></code>), and inverted\ncharacter sets (<code><span>[^,;]</span></code>). All of these can\nbe implemented in terms of the previously defined functions, but they’re common\nenough that it makes sense to make slightly more optimized versions of them and\nprovide an API to access them easily.</p>\n<div><pre><code><span><span>// Equivalent to not(not(pattern))</span></span>\n<span><span>function</span> <span>ahead</span>(pattern) {</span>\n<span> <span>return</span> (text<span>,</span>i)<span>=&gt;</span> (<span>pattern</span>(text<span>,</span> i) <span>&amp;&amp;</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span>i))<span>;</span></span>\n<span>}</span>\n<span></span>\n<span><span>// Equivalent to oneof(pattern, literal(\"\"))</span></span>\n<span><span>function</span> <span>maybe</span>(pattern) {</span>\n<span> <span>return</span> (text<span>,</span>i)<span>=&gt;</span> (<span>pattern</span>(text<span>,</span> i) <span>||</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span>i))<span>;</span></span>\n<span>}</span>\n<span></span>\n<span><span>// Equivalent to oneof(literal(min), ..., literal(max))</span></span>\n<span><span>function</span> <span>between</span>(min<span>,</span> max) {</span>\n<span> <span>return</span> ((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>if</span> (i <span>&lt;</span> text<span>.</span><span>length</span> <span>&amp;&amp;</span> min <span>&lt;=</span> text[i] <span>&amp;&amp;</span> text[i] <span>&lt;=</span> max)</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span> i<span>+</span><span>1</span>)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span>\n<span></span>\n<span><span>// Equivalent to oneof(literal(allowed[0], allowed[1], ...))</span></span>\n<span><span>// or chain(not(oneof(...)), anychar()) when inclusive=false</span></span>\n<span><span>function</span> <span>charfrom</span>(allowed<span>,</span> inclusive<span>=</span><span>true</span>) {</span>\n<span> <span>if</span> (<span>!</span>inclusive)</span>\n<span> <span>return</span> <span>chain</span>(<span>not</span>(<span>charfrom</span>(allowed))<span>,</span> <span>anychar</span>())<span>;</span></span>\n<span> <span>return</span> <span>oneof</span>(<span>...</span>allowed)<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>All right! That’s it. In total, it’s about 100 lines of code for a full\nparsing library that can parse any arbitrary PEG. Unfortunately, it’s kinda slow\nbecause it’s not a proper packrat parser. The defining feature of packrat\nparsers is that they <em>memoize</em> everything, which is what gives them their\namazing speed.</p>\n<h2 id=\"packrat-parsing\">Packrat Parsing</h2>\n<p>We can make this code orders of magnitude faster with 10 lines of code, and\none tiny tweak. All it takes adding caching to the definitions:</p>\n<div><pre><code><span><span>function</span> <span>cached</span>(fn) {</span>\n<span> <span>let</span> cache <span>=</span> {}<span>,</span> cache_text <span>=</span> <span>\"\"</span><span>;</span></span>\n<span> <span>return</span> (text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>if</span> (text <span>!=</span> cache_text) <span>// cache is per-input-text</span></span>\n<span> cache <span>=</span> {}<span>,</span> cache_text <span>=</span> text<span>;</span></span>\n<span> <span>if</span> (<span>!</span>(i <span>in</span> cache))</span>\n<span> cache[i] <span>=</span> <span>fn</span>(text<span>,</span> i)<span>;</span></span>\n<span> <span>return</span> cache[i]<span>;</span></span>\n<span> }</span>\n<span>}</span></code></pre></div>\n<p>And then, wrap all the returned functions with <code><span>cached</span>()</code>. For\nexample, with <code><span>literal</span>()</code>:</p>\n<div><pre><code><span><span>function</span> <span>literal</span>(pattern) {</span>\n<span> <span>return</span> <span>cached</span>((text<span>,</span> i)<span>=&gt;</span> {</span>\n<span> <span>if</span> (text<span>.</span><span>startsWith</span>(pattern<span>,</span> i))</span>\n<span> <span>return</span> <span>new</span> <span>Match</span>(text<span>,</span> i<span>,</span> i<span>+</span>pattern<span>.</span><span>length</span>)<span>;</span></span>\n<span> })<span>;</span></span>\n<span>}</span></code></pre></div>\n<p>That’s it! This is a complete packrat parser! As you’ve seen, the fundamental\ncomponents of a packrat parser are individually quite simple, and there aren’t\nvery many of them. At its heart, a packrat parser is nothing more than a\nmemoized matcher of literals, chains, choices, negations, and recursive\nreferences. There’s nothing that’s fundamentally difficult to understand, just a\nsimple idea that can be implemented in a few dozen lines of code and endlessly\ntweaked and improved on.</p>\n<h2 id=\"future-improvements\">Future Improvements</h2>\n<p>The parser that I’ve walked you through implementing in this post is\nasymptotically quite fast. It runs in\nO(GN)O(G N)\ntime, where\nGG\nis the size of the grammar and\nNN\nis the size of input text; in other words, for a fixed grammar, you will get\nθ(N)θ(N)\nperformance for arbitrary input text. It will also use\nO(GN)O(G N)\nmemory space\n(θ(N)θ(N)\nfor a fixed grammar). For comparison, packrat parsers are capable of matching\nrecursive grammars that are <a href=\"https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags\" target=\"_blank\"><em>impossible</em> to match with extended\nregular expressions,</a> but can do so in <em>linear time,</em> while extended\nregular expressions have potentially unbounded runtime. The downside is that\npackrat parsers use an amount of memory proportional to the length of the input\ntext. <span>Packrat parser runtime may be close to\nθ(N)θ(N)\nfor arbitrary real-world (non-malicious) grammars, since grammars tend to follow\npower-law distributions, where a few rules are called often, but the rest are\nonly rarely called. When this is the case, the expected number of rules tested\nat each string index is bounded by a constant number that doesn’t grow as the\ngrammar adds more rules.</span> However, the implementation here has a lot of\nconstant-time overhead on common operations and it can potentially use quite a\nlot of stack space. Using a virtual machine with a heap-allocated stack instead\nof recursive function calls will lower function call overhead and prevent stack\noverflow, but the overall performance would need to be tuned using a profiler to\nfind performance bottlenecks if you truly cared about speed. In other words,\nthis blog post is intended to give you an understanding of the concepts behind a\npackrat parser, but the code here won’t outperform a real parsing library.</p>\n<p>Besides performance, the programming ergonomics could also use a lot of\nimprovement. The code <code><span>repeat</span>(<span>charfrom</span>(<span>'aeiou'</span>))(input_text<span>,</span> <span>0</span>)</code>\nis much more cumbersome than regex pattern matching: <code>input_text<span>.</span><span>match</span>(<span>/</span><span>[aeiou]*</span><span>/</span>)</code>.\nFortunately, the regex/BNF-like PEG grammar discussed in <a target=\"_blank\" href=\"https://blog.bruce-hill.com/pegs-and-the-structure-of-languages\">PEGs and the structure of\nlanguages</a> is <em>itself</em> a grammar that can be parsed by a packrat\nparser. In my next post, I’ll discuss how to write a program that automatically\nconverts PEG-syntax grammars into runnable parsers–in other words, how to make a\nparser-generator. As a quick teaser, here is how we might start building it:</p>\n<div><pre><code><span><span>let</span> g <span>=</span> <span>new</span> <span>Grammar</span>()<span>;</span></span>\n<span>g<span>.</span><span>AnyChar</span> <span>=</span> <span>named</span>(<span>\"AnyChar\"</span><span>,</span> <span>literal</span>(<span>\".\"</span>))<span>;</span></span>\n<span>g<span>.</span><span>CharSet</span> <span>=</span> <span>named</span>(<span>\"CharSet\"</span><span>,</span> <span>chain</span>(<span>literal</span>(<span>\"[\"</span>)<span>,</span></span>\n<span> <span>repeat</span>(<span>not</span>(<span>literal</span>(<span>\"]\"</span>))<span>,</span> <span>capture</span>(<span>anychar</span>()))<span>,</span></span>\n<span> <span>literal</span>(<span>\"]\"</span>)))<span>;</span></span>\n<span><span>// ...</span></span>\n<span>g<span>.</span><span>Expression</span> <span>=</span> <span>oneof</span>(</span>\n<span> g<span>.</span><span>ref</span>(<span>\"AnyChar\"</span>)<span>,</span> g<span>.</span><span>ref</span>(<span>\"CharSet\"</span>)<span>,</span> <span>...</span>)<span>;</span></span>\n<span></span>\n<span><span>function</span> <span>make_parser</span>(grammar_text) {</span>\n<span> <span>let</span> match <span>=</span> g<span>.</span><span>Expression</span>(grammar_text<span>,</span> <span>0</span>)<span>;</span></span>\n<span> <span>if</span> (<span>!</span>match) <span>throw</span> <span>\"Grammar failed to compile\"</span><span>;</span></span>\n<span> <span>switch</span> (match<span>.</span><span>name</span>) {</span>\n<span> <span>case</span> <span>\"AnyChar\"</span><span>:</span> <span>return</span> <span>anychar</span>()<span>;</span></span>\n<span> <span>case</span> <span>\"CharSet\"</span><span>:</span> <span>return</span> <span>charfrom</span>(<span>get_captures</span>(match))<span>;</span></span>\n<span> <span>//...</span></span>\n<span> }</span>\n<span>}</span>\n<span><span>function</span> <span>match</span>(grammar<span>,</span> text<span>,</span> offset<span>=</span><span>0</span>) {</span>\n<span> <span>return</span> <span>make_parser</span>(grammar)(text<span>,</span> offset)</span>\n<span>}</span>\n<span></span>\n<span><span>let</span> result <span>=</span> <span>match</span>(<span>\"[aeiou]\"</span><span>,</span> input_text)<span>;</span></span></code></pre></div>\n<p>Look forward to a new post coming soon!</p>\n</article>\n</div>",
"author": "@spillybones",
"favicon": "",
"source": "blog.bruce-hill.com",
"published": "2021-06-05",
"ttr": 460,
"type": "website"
}