Piccolo – A Stackless Lua Interpreter
2024-05-01
- History of piccolo
- A "Stackless" Interpreter Design
- Benefits of Stackless
- The "Big Lie"
- Rust Coroutines, Lua Coroutines, and Snarfing
- Zooming Out
piccolo is an interpreter for the Lua
language written in pure, mostly safe Rust with an eye towards safe sandboxing
and resiliency. It uses gc-arena for its
garbage collection system, and in fact gc-arena
was originally created as part
of piccolo
.
You can try it out below in an the interactive REPL. I'm not much of a web
developer, and this is a little homegrown web terminal thingy, so hopefully this
works for you. I'm going to be using REPLs like this to demonstrate a lot of
what makes piccolo
unique, so if it doesn't work or you or you don't have
Javascript turned on, then this might be a pretty boring post and I'm sorry!
↓ Type Some Lua Here ↓[1]You know, if you want to...
I find REPLs to be really magical and inviting,[2] and I end up eventually wanting to attach them to everything I ever make.[3]It's possible to like things too much. Not just a REPL but the idea that your program is a sort of Living Thing that understands a Language, and if the normal UI isn't fit for purpose and you're enough of a Witch, you can always just Speak to the program in the way it naturally understands... cast a Spell to achieve your Goals, or maybe just have a Conversation. I think it helps close the gap between the author of a program and the user. I'm not better than the user, who am I to tell them what they can and can't say to the program?
I hope you feel the same way about REPLs as I do because there are a lot of them on this page, and I'm going to ask you to type things into them. If you're not into it, well... I'll try and always give you working code that you can copy and paste, but where's the fun in that?
I said in my last post in this series
that my goal wasn't to try to sell anyone on gc-arena
or piccolo
[4]yet anyway
and I think that's still true here. piccolo
is pretty
rough around the edges[5]which you probably noticed if you tried to
use large parts of the stdlib in the REPL above
right now, but it's
more complete than you might think[6]The implementation strategy so
far has been to do the hardest parts first to prove that the basic idea works,
so most of really hard parts of the VM are feature complete.
and it
has some interesting and unique features. Still, I'm not telling you to go out
and replace LuaJIT or Luau or
The Ur Lua with piccolo
just yet.
In this post, I just want to talk about what makes piccolo
special, and
hopefully you'll find it interesting. In a future post, I'm going to tie all of
this together, the way gc-arena
and piccolo
are designed, how they work now,
and how I wish they could work in the future, but this post is going to focus
just on piccolo
as it works right now.
History of piccolo
When I was first writing piccolo
(in 2019), I had noticed that nobody had
quite figured out how to make VMs for certain kinds of languages that could be
competitive with C while being implemented primarily in safe Rust. Mostly I'm
referring to problems surrounding garbage collection, rather than something like
designing fast interpreter loops (which is something I'm not very good at yet!).
Languages that have ownership semantics similar to Rust are of course much
easier to write VMs for in Rust, because the implemented language can
snarf[7]my absolute favorite PLT jargon
much of the
functionality from the host language. It's pretty easy to express the exact
semantics of Rc
by just... using Rc
itself to implement the language. There
are several such scripting languages that try to have matching ownership and
mutability semantics with Rust and I think that's honestly a great idea because
sharing these core semantics with the host language just removes a huge amount
of cognitive burden when crossing the language boundary, and you can make an
embedded language this way that feels like it fits in perfectly with the host.
However, I also think it's a bit of a disappointment if only Rust-like
languages can be easily made using Rust. Certainly this is not actually true,
and there are plenty of other Rust runtimes for languages with "unrestricted
ownership" (proper garbage collection, unrestricted references... the terms for
this are a bit all over the place, but what I mean is languages like Lua).
At the time at least, when I surveyed language runtimes written in Rust they
broadly fell into one of several categories, none of which was what I wanted
for piccolo
...
- Languages with ownership semantics similar to Rust, so no "true garbage collection" / "unrestricted ownership" or whatever you want to call it (dyon, rune, etc...)
- Language runtimes with true garbage collection (tracing or generational collector) but whose implementations were wildly, infectiously unsafe as they would be in C, due to the nature of garbage collected pointers and their interactions with the Rust stack.
- Language runtimes for languages that are meant to have proper garbage
collection but the implementer used
Rc
or similar and left the problem of what to do about reference cycles for later (RustPython).
I wanted to have a language for Rust that felt as natural as Lua does for C, and one that had true garbage collection[8]and I have extremely bad NIH syndrome ... and frankly I really like just plain vanilla Lua. I think it matches perfectly with Rust because they're so different, I think having an Rust embedded language that frees you from even having to think about ownership is very powerful because it can be used for things where having to think about ownership can be more trouble than its worth. Let each language play to their strengths, and Rust and Lua in a lot of ways have complementary strengths.
Since I just looooove Lua so much and I had so much experience with vanilla
PUC-Rio Lua (aka The Ur Lua), I decided to try and write an
interpreter designed similarly to[9]shamelessly stolen from
PUC-Rio's Lua, with a similar sort of garbage collector, but because I was
primarily interested in sandboxing untrusted scripts, somehow made of mostly
safe Rust.[10]or at least not like
this
gc-arena
was born out of my efforts to solve this problem.
But I really don't intend to throw any shade at any of the projects I listed
above or any other Rust-implemented language runtime written in a different
style. These are hard problems and gc-arena
is not a perfect solution.
In fact, in the early days of gc-arena
and piccolo
, I ran into so many
seemingly unsolvable problems that I became hopelessly frustrated and gave up on
piccolo
entirely for about four years.
It was only through Ruffle's use of
gc-arena
and Ruffle contributors helping to get through the roadblocks we
encountered that I was eventually able to pick piccolo
back up. Today, there
are not nearly so many unsolved problems in trying to use gc-arena
to
implement a language like Lua or ActionScript, but it really came down
to Ruffle
contributors helping to solve each issue one by one over the
intervening years.
BUT, even with all of the major roadblocks overcome (pointers to Unsize
types,
GC finalization, non-'static Any
, a bunch more I've forgotten) influence from
the biggest limitation of gc-arena
stubbornly remained: the "Mutation XOR
Collection" design of gc-arena
that was talked about in the
last post. gc-arena
's design
requires that code that wants to access garbage collected data do so through
special mutation methods, and that collection must ONLY happen when no
mutation methods are executing.
This "Mutation XOR Collection" design means that calls to Arena::mutate
must return before garbage collection can safely take place, to prove that
no garbage collected pointers exist anywhere on the Rust stack. Unless I were
willing to just give up on ever hoping to match Lua runtimes written in C, or
were willing to limit the places piccolo
could be used,[11]If I were
making a runtime that was more limited in purpose, I could instead limit garbage
collection to, say, once a frame if I were making a video game or something like
Ruffle, or just simply require that continuous execution of Lua not go on
for "too long", but this would make piccolo
much less general.
I had to figure out a way to make the entire execution context of piccolo
suspendable, just to be able to leave calls to Arena::mutate
, so that
garbage collection could take place at arbitrary points.
At the beginning, this limitation infuriated me, and I spent ages trying
anything I could to find an acceptable way around it. It still certainly
is limiting, but now that piccolo
has gotten further along I think what I've
ended up with is actually very cool, and what started out as purely a painful
compromise might actually end up being piccolo
's "killer feature"...
A "Stackless" Interpreter Design
Some of the biggest, most interesting features of piccolo
come from its
"stackless" design, originally born only from necessity due to the limitations
of gc-arena
. This design is similar to other "stackless" interpreters, and
the one most people have heard of is
Stackless Python, so if
you're familiar with it, most of what you know will be applicable to piccolo
as well.
"Stackless" here is jargon that's used in a couple of places, not just in interpreter design. You may have heard that Rust has "stackless" coroutines, and the word "stackless" as I'm using it here means the exact same thing. It means that piccolo's Lua runtime is not "stackful", it does not rely on the Rust function call stack to maintain its execution state, and execution can be suspended at any time. This applies not just for plain interpreted Lua bytecode but also for Rust code executed from Lua (callbacks) in any form, for any depth of Lua -> Rust -> Lua calls. The overriding design decision made early in piccolo's life was that execution of Lua (and ALL callback code called from Lua) must always be able to be suspended, and that execution would be driven from the outside by polling:
// This is pseudocode to demonstrate the "stackless" or "trampoline" VM style,
// how this really works inside piccolo is slightly more complex.
// Set up the execution of some Lua code and produce an object representing the
// execution state. The internal Lua / Callback call stack is reified into this
// object, and does not rely on the normal Rust function call stack.
let execution_state = ...;
// We need to call a method to make progress, and call that method in a loop
// until the running Lua code is complete.
loop {
match execution_state.poll() {
None => {
}
Some(result) => {
break;
}
}
}
This should be extremely familiar to anyone who has ever touched Rust futures,
and the similarity is no accident. The core of the design of piccolo
is
virtually the same as Async Rust: that all long running operations are reified
into objects that must be polled to completion.
The obvious benefit for piccolo
is that it becomes trivial now to exit a call
to gc_arena::Arena::mutate
and allow for collection, since we can now do so
in-between calls to execution_state.poll()
.[12]In reality this is
piccolo::Executor::step
, but I wanted to show the similarity to normal Rust
futures.
What I didn't fully appreciate when I began writing piccolo
is that this style of writing an interpreter, though at times more difficult,
comes with many other benefits that make it (in my opinion) a worthwhile goal,
or at least a very interesting place in the design space and I hope a unique
niche that piccolo can fill.
Benefits of Stackless
Cancellation
An obvious side benefit of polling execution in the style of Future
is that,
just like a Future
, execution can be canceled at any time by just... not
continuing to call poll
.
Let's go back to the REPL from above, but this time, let's see what happens when we run some Lua code that never returns.
If you don't know much Lua, try typing something like:
while true do end
or maybe
repeat
print("Look Around You")
until false
↓ Just Endlessly Do It ∞ ↓
You should see a big interrupt button appear, and when you press it, the command should stop. How this works under the hood in this demo is that inside this webpage there is some javascript that looks something like this:
this.interval = setInterval(() => {
if (this.executor.step(8192)) {
this.finish();
}
}, 0);
This is the "poll loop" that we talked about above that polls running
Lua code to completion. This is still not exactly how it would look when
using piccolo
directly but it's a little closer... The executor
there is
a piccolo::Executor
object,[13]Well, it's a simplified wrapper for
Javascript
and Executor::step
is called in a loop until the code
has completed. Here, Lua execution actually hooks into the normal Javascript
event loop, every time the closure is run, the piccolo::Executor
is "stepped"
for 8192
"steps". The "steps" value here is referred to inside piccolo
as
"fuel" and (more or less) corresponds to a number of Lua VM instructions to run
before returning. Since the timeout given to setInterval
is 0
, we run this
function regularly and rapidly but without blocking the main Javascript event
loop. When the interrupt button is pressed, the interval is canceled and
the executor is dropped, interrupting execution. In fact, every REPL on the page
works in the same way and shares the main Javascript event loop, so all of them
can execute Lua code concurrently.
Interruptable Lua code is not something new to piccolo
, PUC-Rio Lua (and most
of its forks) have something like this in the form of
lua_sethook. This
function allows you to, among a few other things, set "hook" function that runs
every count
VM instructions, and one of the things this function can do when
run is interrupt running code by calling e.g.
lua_error.[14]If you also know that you can call
lua_yield from a hook
function and mimic what piccolo
tasklets do, I know that too, wait just a
bit and I'll talk about it.
So we can imagine a situation in which we
can set up something similar to what piccolo
is doing here, either by running
Lua code in a different thread and waiting for an interrupt
flag to be set in
the hook function, or by pumping an event loop from within the hook function or
something similar.[15]If you're internally screaming about calling
lua_yield
from a hook, wait.
However, I would argue that the way piccolo
is structured makes this
effortless and natural due to its stackless design. Since PUC-Rio Lua is written
in normal, stackful style, the best thing it can offer is a hook function that
will be periodically called by the VM loop, whereas with piccolo
the user
never loses control to the VM in the first place. piccolo
is designed such
that a call to Executor::step
should always return in a reasonable, bounded
amount of time proportional to the "fuel" it is given,[16]Ensuring
this is true in all cases so that it can be relied on as a security boundary
is complex and a WIP, but is 100% a goal of piccolo
.
so it is not
necessary to provide an equivalent to lua_hook
at all.
Pre-emptive Concurrency
One of Lua's best and most defining features is its support for coroutines. Coroutines in Lua can be used to provide seamless cooperative multitasking, and are especially powerful for things like game development where some kind of script must execute concurrently with the running simulation of a video game.
However, Lua coroutines only provide cooperative multitasking, the script must
decide where and when to yield control to the caller, and a buggy (or malicious)
script that does not yield control may need to be interrupted and canceled (via
lua_sethook
) or might make the game hang.
Rust (at least, unstable Rust) also has coroutines, and they are used behind
the scenes to implement async. In Rust, like in Lua, these coroutines provide
cooperative multitasking, Rust code must decide when to call await
,
or an implementation of Future::poll
must decide when to return. A buggy
implementation of Future
will hang an async executor thread just like a buggy
Lua coroutine might hang a game loop.
In piccolo
, running Lua code acts very similarly to a Rust "task" (a term for
something that implements Future
and is run on an async "executor"), and like
Rust tasks, they can easily be run concurrently. However, piccolo
works very
hard to guarantee that piccolo::Executor::step
returns in a bounded amount
of time, even without the cooperation of the running Lua code. So, by using
several independent piccolo::Executor
"tasklets" and multiplexing calls to
each piccolo::Executor::step
, we can give Lua pre-emptive multitasking.
It's easier to understand with a demonstration. The two REPLs below are
connected to one Lua instance. Instead of a single print
function, they have
the functions print_left
and print_right
to print in the left or right REPL
console. They share global variables, so we can use this to demonstrate that the
two interfaces are running Lua code on the same VM.
In the left REPL, type something like this
i = 0
while true do
i = i + 1
end
While that is running in the left REPL, in the right REPL type this:
while true do
print_right(i)
end
↓ These two REPLs are connected to the same Lua
instance! ↓
You should notice that it appears that two separate Lua REPLs access the same
state, seemingly in parallel! In fact, if you copied the exact code above, the
right REPL probably prints values of i
seemingly at random, every thousand and
a half or so increments.
In this demo, this behavior comes from the way that running Lua code is run
inside setInterval
callbacks... The REPLs here work exactly the same as any
of the REPLs above except that they both share a Lua instance, and this
really is the only difference. There are two setInterval
callbacks calling
Executor::step
being run by the browser at the same time and each callback
is run in a round-robin fashion.[17]I actually don't know how the
task scheduler in browsers works exactly, but I think it will execute
both REPLs in a simple round-robin way?
In a plain Rust environment
you could get the same behavior by looping and calling Executor::step
for
one executor then another in turn, in a simple round-robin scheduler. This is
very similar in a way to OS threads which also are pre-emptively scheduled, but
instead of using an OS scheduler, we write our own scheduler and execute some
running Lua for a time slice via calls to Executor::step
.[18]This
idea is not new, and other stackless interpreters have called this scheduling
idea tasklets.
In fact, though you can't observe actual data races here or even an
analogue of it within Lua, you can observe a mirror of other race condition
problems that OS concurrency primitives are meant to solve, and a custom
scheduler for these Lua "tasklets" might even want to provide a version of
common OS primitives to prevent them and aid scheduling.[19]Stackless
Python has a custom version of
channels to
communicate between tasklets that serves this purpose, but I don't actually know
whether these channels affect tasklet scheduling (but they could!).
This is also again, VERY similar to how rust Futures (well, tasks) work when
running on an async executor. Async tasks can be round robin scheduled or in
some more advanced way, but each task has a "slice" of execution given to
it by calling its Future::poll
method. The difference in piccolo
is that
Lua scripts are not actually aware that they are being scheduled this way,
and from the perspective of Lua, this scheduling happens pre-emptively. Rust
callbacks in piccolo
are more privileged than this and actually receive a
piccolo::Fuel
object that they can use to consume fuel proportional to their
work, and must be trusted to cooperatively schedule themselves (you can always
incorrectly write loop {}
in a callback, after all), but Lua code cannot
break out, at least not on its own.
Yet another way to look at this is that piccolo
executes Lua code sort of
as if there are something like invisible coroutine.yield
statements inserted
everywhere, but ones that operate on a different level of abstraction from the
real coroutine.yield
, ones which regular Lua code cannot interact with.
Let's imagine we transformed the above code that I asked you to type in the paired REPLs into something like this in plain Lua:
-- This example is in a big function loaded in the REPL below so you can easily
-- call it, and I'll do the same thing with later examples. I'm not gonna ask
-- you to type *that* much.
function coroutine_example()
local i = 0
local co1 = coroutine.create(function()
while true do
i = i + 1
coroutine.yield()
end
end)
local co2 = coroutine.create(function()
while true do
print(i)
coroutine.yield()
end
end)
while true do
coroutine.resume(co1)
coroutine.resume(co2)
end
end
↓ The above code is loaded here you want to run it ↓
This behaves sort of like the code above but in a much more predictable way.
If you know Lua and are comfortable with coroutines, you can probably tell that
the above code is pretty much just a convoluted version of a simple for loop,
but it's enough to demonstrate the idea. We have two coroutines that execute
their own loops independent of each other and we schedule between them, but this
time we require that the body of the coroutines decide where to yield to the
scheduler to allow the other task to run. Stackless execution in piccolo
is
almost the same as if we could painlessly automatically insert these calls
to coroutine.yield
everywhere in the body of our running Lua tasks and use
this to pre-emptively rather than cooperatively schedule them.
Fuel, Pacing, and Custom Scheduling
In the last section where I transformed the code executed in the concurrent
REPLs into basic Lua coroutines, you may have noticed a big difference between
the two. In the first example, the scheduling between the two Lua REPLs was
somewhat random and hard to discern, the left REPL would increment the i
value
more than a thousand times for every time the right REPL printed the value of
i
, but in the second example the first task would increment the i
value once
for every time the second task printed i
. The reason for this has to do with
how the javascript for this page is actually written, but it's a perfect, simple
example of something using piccolo
enables: custom tasklet scheduling.
I've mentioned "fuel" before in the context of piccolo::Executor::step
. Here
is the real signature of Executor::step
inside piccolo
:
impl Executor {
Runs the VM for a period of time controlled by the `fuel` parameter.
The VM and callbacks will consume fuel as they run, and `Executor::step`
will return as soon as `Fuel::can_continue()` returns false *and some
minimal positive progress has been made*.
Returns `false` if the method has exhausted its fuel, but there is
more work to do, and returns `true` if no more progress can be made.
pub fn step(self, ctx: Context<'gc>, fuel: &mut Fuel) -> bool {
...
}
}
The method requires a fuel: &mut Fuel
parameter to, well, "fuel" the VM, and
the running VM consumes this fuel as it runs. Fuel
is a very simple wrapper
around an i32
value (you can see the current implementation
here),
that is decremented by Executor
as it runs Lua code, and also optionally by
any Rust callback that it calls. piccolo
's ultimate goal is to enable
treating all loaded Lua code as potentially malicious, but Rust callbacks
are never on the other side of this security boundary. Callbacks are meant
to cooperate with the execution system of piccolo
and act as part of the
security boundary to potentially malicious Lua, and as such, they can consume
Fuel
or even "interrupt" the "flow" of fuel to the Executor
that calls them.
This system makes a lot of sense to provide, and not only strictly for
security. piccolo
's goal is to enable Lua tasks to run concurrently not
only with Rust but with each other, and as such there are many ways we we might
want to give certain tasks more or less time to run. We could imagine a game
engine where we want to provide a sandbox for running Lua code such that no
matter what, if the script is badly written or buggy, that the game simulation
can continue without being compromised. Tasks could be scheduled such that
they are assigned a certain amount of fuel "per second" up to a predefined
"tank limit", giving them a kind of "burst" fuel. In this way, a task that
periodically needs a lot of computational power can get it, but a broken task
that has infinite looped will always use a much smaller amount of sustainable
fuel per frame.[20]You could get funky with this too, make game
entities that have scripts attached that always use the maximum fuel get
hot in game and make that have a gameplay effect. This might only be fun in
something like ComputerCraft... or maybe it's not fun at all and you shouldn't
listen to me about gamedev... probably the second one.
Besides just "consuming" fuel, another thing a Rust callback can do is
interrupt fuel. This is quite similar in behavior to just consuming all of
the remaining fuel so the difference isn't that important, but it exists to mesh
well with the use case described before, where we want to give tasks a
sizeable "reserve tank" of fuel. "Interrupting" fuel flow makes the outer
Executor::step
immediately return to the Rust code calling it, no matter the
amount of fuel currently consumed. This is mostly useful for technical purposes,
for example if one Lua task is waiting on some event and cannot possibly
currently make any progress, or if some callback must return to the outer Rust
caller immediately to take effect.
This is what is happening on REPLs on this page when print
is called! I
noticed when testing the REPL I was writing that calling print
in a hot loop
slowed down the whole page quite a lot, and mostly just to create and destroy
a bunch of output div
s faster than they could be read as the output went far
past the console scrollback limit. So, to fix this, I made print
callbacks
in this page always call Fuel::interrupt
to make the page more responsive
during hot loops. When you call print
in a REPL on this page, it immediately
yields control to the browser task queue! This is the sort of thing that having
deep control over VM scheduling allows you to do: customize it to make it
work in many different situations.[21]Yes I know you can also use
lua_yield
in a callback for the same effect, but crucially, that means you
cannot mix these callbacks with normal Lua coroutines. I'm going to talk more
about this before the end, I promise.
"Symmetric" Coroutines and coroutine.yieldto
It's tough to talk about coroutines because there tend to not be universally agreed upon definitions, but I'm going to try. You might want to have the wikipedia article on coroutines and possibly this paper open if you want the full extra credit for this section.
Lua has what is usually referred to as "asymmetric coroutines", which are (as far as I can tell) the most commonly seen type of coroutine. This is also the same sort of coroutine that Rust supports with the (unstable) std::ops::Coroutine trait. As such, this can feel like a fancy term for a simple idea, but it refers to the limitation that coroutines yield only to their caller. It is also possible to instead support fully "symmetric" coroutines that can yield to any other coroutine, not just the calling one!
This is probably easier to understand expressed as a Lua function that almost provides what we want. Symmetric coroutines work as if we had the following function in Lua:
-- We want to yeild control to another coroutine directly, so we need to yield
-- control *and* resume another coroutine somehow at the same time.
function yieldto(co)
Unfortunately this doesn't quite work as is, because there is no such
thing as a "tail yield" in standard Lua. This will resume the given
coroutine, but it will keep around a stack frame for *this* coroutine
until the entire sequence of coroutines eventually returns, which may
be *never*. As it is, this "works" but it is a stack leak.
coroutine.yield(coroutine.resume(co))
end
We want a function that can yield to another coroutine by resuming that
coroutine and yielding to the caller... whatever that other coroutine would
have yielded. The problem is that this function as written is a stack leak:
there is no way for normal Lua to "tail yield" like it can "tail return", the
yieldto
function as written will consume stack space for the current call to
coroutine.resume
, only giving the stack space up when the stack of coroutines
eventually finishes. True symmetric coroutines do not have this limitation, and
can mutually yield to each other without bound.
Because of the way piccolo
's Executor
works, Lua control flow that might
normally be expressed as a Rust function call (such as a callback resuming
another coroutine) is reified into the structure of the Executor
, and the
actual Rust control flow always "jumps back up" to Executor::step
. This is
actually the origin of the term "trampoline style" when referring to stackless
interpreters, that control flow always "jumps back" to the same place. In
PUC-Rio Lua, coroutine.resume
is a normal C function call, so it is impossible
to directly support this "tail yield" operation and avoid the stack leak, but
piccolo
's design just so happens to allow easily providing this as a builtin
function: coroutine.yieldto
, enabling full symmetric coroutines!
Let's see how it works...
-- I'm deeply sorry for this example...
function browse_internet()
local be_bored
local be_optimistic
local be_dooming
be_bored = coroutine.create(function()
while true do
print("I'm bored, I think I'll mindlessly browse The Internet")
coroutine.yieldto(be_optimistic)
end
end)
be_optimistic = coroutine.create(function()
while true do
print("Maybe The Internet won't be so bad this time")
coroutine.yieldto(be_dooming)
end
end)
be_dooming = coroutine.create(function()
while true do
print("I think I need a break from The Internet")
coroutine.yieldto(be_bored)
end
end)
coroutine.resume(be_bored)
end
↓ You can run the 100% accurate Internet Simulator below ↓
Now, unless you're already pretty familiar with coroutines (or for some
unearthly reason you decided to stop reading this and instead go carefully read
the paper I linked earlier),
you might not know that "symmetric" and "asymmetric" coroutines are actually of
equivalent expressive power. Let's pretend that we don't have
coroutine.yieldto
and transform the previous example a bit to make up for it.
-- I'm still sorry...
function browse_internet()
Because we don't have proper `coroutine.yieldto`, we need some way of
returning to the outer level and resuming the next coroutine. We can't
provide this as a function because there's no way around the stack
leak, but we can provide it as an outer "runner".
function run_coroutines(co, ...)
local _, next = coroutine.resume(co, ...)
if not next then
return
end
return run_coroutines(next)
end
Afterwards, we change every call to `coroutine.yieldto` to
`coroutine.yield`, and wrap the coroutines in our "runner".
local be_bored
local be_optimistic
local be_dooming
be_bored = coroutine.create(function()
while true do
print("I'm bored, I think I'll mindlessly browse The Internet")
coroutine.yield(be_optimistic)
end
end)
be_optimistic = coroutine.create(function()
while true do
print("Maybe The Internet won't be so bad this time")
coroutine.yield(be_dooming)
end
end)
be_dooming = coroutine.create(function()
while true do
print("I think I need a break from The Internet")
coroutine.yield(be_bored)
end
end)
run_coroutines(be_bored)
end
↓ After the above transform, our simulator is still 100% Accurate ↓
So, the coroutine.yieldto
function that piccolo
provides doesn't actually
make Lua fundamentally any more powerful, instead it is more of a convenience.
So why bring this up? Well, besides it being a very neat function to be able
to provide, and piccolo
being able to provide it in a way that doesn't
require any outside "runner", I wanted to bring attention to the idea of
transforming code like this.
It's no coincidence that piccolo
has an easy time providing
coroutine.yieldto
. The above transform takes normal control flow and turns
it into control flow via return values. This is very nearly the exact same
transform that has already been done by piccolo
's stackless design that I've
been talking about this whole time.
In fact, let's look at the actual implementation of coroutine.yieldto
in the code for
the coroutine
lib inside piccolo
:
coroutine.set(ctx, "yieldto", Callback::from_fn(&ctx, |ctx, _, mut stack| {
let thread: Thread = stack.from_front(ctx)?;
Ok(CallbackReturn::Yield {
to_thread: Some(thread),
then: None,
})
})).unwrap();
Ignoring some of the unimportant details, we see that the 'yieldto'
field is
set to a callback function, and that callback function takes a single argument
of a Thread
. Then, it returns an enum value CallbackReturn::Yield
and states
which thread to yield to (the normal coroutine.yield
function simply sets
to_thread
to None
instead). This is exactly the same as the transform that
we've already done above, which shows why this is so simple for piccolo
to
provide: piccolo::Executor
already works like this.
The "Big Lie"
So far I have talked a lot about piccolo
's unique design, and how it allows
piccolo
to have powers that other Lua interpreters can't have. I have been
lying to you! The actual truth is rather complicated, and you need the context
of everything I've said so far to fully understand it.
The real truth is... PUC-Rio Lua can already sort of do about 70% of the same
things piccolo
can do. In fact, piccolo
is not uniquely designed at all,
it is the natural conclusion to the way PUC-Rio Lua already works.
Let's start by doing something that I think almost nobody who uses PUC-Rio Lua or Luau or LuaJIT knows that they can do.[22]LuaJIT is slightly more complicated because you probably have to disable the JIT to make it work in all cases. We're going to implement tasklets using the plain Lua C API!
I don't have the energy to get normal PUC-Rio Lua 5.4 working in a browser with Emscripten, so you won't be able to run these examples interactively, you'll just have to trust me (or set up a C build environment with Lua 5.4 and try them yourself). You'll also have to understand C and the PUC-Rio Lua C API to fully understand these examples, but hopefully I can comment them enough to show what's going on even if you don't.
#include <assert.h>
#include <stdbool.h>
#include "lauxlib.h"
#include "lua.h"
#include "lualib.h"
// We will set up a "hook" function for the Lua VM to periodically call.
//
// In this case, the "hook" function always *yields*, which will only work if
// the calling Lua thread is itself a coroutine (and not the main thread).
//
// We are sort of using the "hook" function to *externally insert* calls to
// `coroutine.yield` periodically in otherwise unmodified Lua.
void yield_hook(lua_State* L, lua_Debug* _ar) {
lua_yield(L, 0);
}
int main(int _argc, char** _argv) {
Open the main Lua state and all of the Lua stdlib.
lua_State* L = luaL_newstate();
luaL_openlibs(L);
Create a thread separate from the main one to use as our coroutine
thread.
lua_State* co = lua_newthread(L);
Load *unmodified* Lua code, no manual calls to `coroutine.yield` are
necessary here.
assert(
luaL_loadstring(
co,
"while true do\n"
" print('hello')\n"
"end\n"
)
== LUA_OK
);
Set our hook function on the coroutine thread, *forcing* the coroutine to
yield whenever the hook is called.
In this case, the hook will be called every 256 VM instructions.
lua_sethook(co, yield_hook, LUA_MASKCOUNT, 256);
Our main loop.
Every time through the loop, we resume our coroutine. The hook function
*externally* causes the called Lua code to periodically yield without
having to modify our Lua source code to manually add `coroutine.yield`
statements.
When running this C code with my current version of Lua 5.4, I see 64
"hello" lines for every 1 "there" line, showing that execution correctly
periodically returns to C.
while (true) {
int nresults;
assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);
lua_pop(co, nresults);
printf("there\n");
}
}
The example above shows a fully working, if simplistic, Lua tasklet system.
In the same way that piccolo
's Executor::step
function works "as though"
there are invisible periodic calls to coroutine.yield
everywhere, calling
lua_yield
from a lua_Hook
function also (and much more literally) inserts
invisible periodic calls to coroutine.yield
. This is more or less everything
required for a tasklet!
PUC-Rio Lua can do about 70% of what piccolo
can do, right out of the box! The
problem is the last 30%. Let's modify the example above very slightly...
#include <assert.h>
#include <stdbool.h>
#include "lauxlib.h"
#include "lua.h"
#include "lualib.h"
// Same as last time, we effectively insert invisible periodic calls to
// `coroutine.yield`.
void yield_hook(lua_State* L, lua_Debug* _ar) {
lua_yield(L, 0);
}
int main(int _argc, char** _argv) {
Open the main Lua state and all of the Lua stdlib.
lua_State* L = luaL_newstate();
luaL_openlibs(L);
Create a thread separate from the main one to use as our coroutine
thread.
lua_State* co = lua_newthread(L);
Still *unmodified* Lua code with no manual calls to `coroutine.yield`.
We make one small change though, before calling `print('hello')`, we call
`table.sort` to sort a Lua table. The callback isn't important here, but
what's important is that `table.sort` is a C function which calls a Lua
function (the comparator).
We put a big for loop in the comparator function just to make sure the VM
spends some actual time here, but no matter what, the same behavior will
eventually occur if you use Lua -> C -> Lua callbacks at all.
assert(
luaL_loadstring(
co,
"while true do\n"
" table.sort({3, 2, 1}, function(a, b)\n"
" for _ = 1,1000000 do end\n"
" return a < b\n"
" end)\n"
" print('hello')\n"
"end\n"
)
== LUA_OK
);
Set our hook function just like last time to execute every 256 VM
instructions.
lua_sethook(co, yield_hook, LUA_MASKCOUNT, 256);
Main loop, unmodified from the previous example.
while (true) {
int nresults;
assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);
lua_pop(co, nresults);
printf("there\n");
}
}
If you try and run this C code, it will immediately error on this assert
in the main loop:
assert(lua_resume(co, NULL, 0, &nresults) == LUA_YIELD);
The reason for this is that lua_resume
is erroring and returns LUA_ERRRUN
instead of LUA_YIELD
. This is happening because lua_yield
, which is
being called from our "hook" function, cannot be called from within most
C callbacks. What is the C callback? It's our call to the stdlib function
table.sort
within the body of the tasklet loop. At the time that the call to
lua_yield
is called incorrectly, the C call stack looks something like this
(simplified):
main
-> lua_resume
-> luaV_execute
(the main VM loop) -> sort
(in
ltablib.c) -> lua_call
-> luaV_execute
(the main VM loop again, for the
comparator) -> yield_hook
-> lua_yield
PUC-Rio Lua uses the normal C stack for much of its internal state, and calls
to lua_yield
are expressed as a C
longjmp, jumping back up to
an upper C frame and popping any call frames in-between from the call stack.
So, certain operations are simply disallowed when the inner call to longjmp
would destroy essential information about the runtime state.
There IS a way around this problem, however. Ultimately, the problem is that
the call to table.sort
, a C function, in turn calls a Lua function with the C
API function lua_call
. Any Lua function called this way is disallowed from
calling coroutine.yield
(or its C equivalent lua_yield
). PUC-Rio's C API
provides a special version of lua_call
to get around this:
lua_callk. You can read
in more detail about the entire situation in the section of the PUC-Rio Lua 5.4
manual called
Handling Yields in C.
This does work, and in this way, PUC-Rio Lua provides the ability to yield from
situations like this Lua -> C -> Lua sandwich. However, table.sort
is not
written this way, and in fact almost none of the stdlib is written this way at
all![23]The sole exception I am aware of is
pcall.
The reason for this is, frankly, that transforming C code to work this way
is enormously difficult. The C code in question must be able to handle a
longjmp
, when the inner Lua code triggering a longjmp
will destroy
(not even unwind!) the current C stack up to where it was called, and the only
way for the C code to resume is through the lua_KFunction
and lua_KContext
passed to lua_callk
. There are no Drop
impls to rely on, no automatic memory
management, no coroutines, the C code must be transformed so that it relies
entirely on a type pointed to by a lua_KContext
for its state, so that it can
be suspended at any time.[24]This should sound familiar.
This is not the only problem, either. By repurposing normal Lua coroutine yields
like this to yield back to C, you take away Lua coroutines from the usable part
of the Lua language. If we were to try to use normal coroutines in our tasklet
system, the inner lua_yield
from the hook function would just yield to the
nearest thing that has called lua_resume
, which in this case would be the
Lua thread which called coroutine.resume
and not the top-level C code. I
love coroutines,[25]As much or more than I love REPLs!
and
Lua without coroutines is frankly no longer really Lua, but with enough effort,
I think you could get around this problem too! Remember the transform we did
before, where we made symmetric coroutines out of asymmetric ones? You can do
something similar with the normal Lua C API but wrapping coroutine.yield
in
a special version that instead returned whether or not it is a real yield or
a synthetic one from the lua_Hook
function. You would have to go further than
this to make it work, restructuring all of the other coroutine
methods so that
which thread was waiting on the results of which other thread was kept in an
array rather than the C stack, so that the coroutine "tasklet" system continues
to work while providing a similar, inner system for "normal" Lua coroutines.
You could also do the work of re-implementing every function in the stdlib
that calls back into Lua[26]There actually aren't that many, but
it's deceptive how many there are because much of the time it happens due to
triggered metamethods.
in such a way that it used lua_callk
with a
continuation function instead of lua_call
, too, so that every C function in
the stdlib became suspendable. For good measure, you could also periodically
yield long running callbacks even if they didn't call back into Lua, just to
make sure that execution always jumped back out to the outermost C code in a
bounded amount of time.
So lets summarize this list of theoretical changes we can make to PUC-Rio Lua to make a complete tasklet system.[27]The "last 30%" is pretty generous. In reality, the last 30% makes this "vanilla" tasklet system pretty useless in a huge number of cases. It's not that 30% of use cases are non-viable, it's that 30% of the surface area of Lua no longer works, so without further changes many use cases would be non-viable.
- Use
lua_Hook
functions to insert synthetic calls tolua_yield
within all Lua code. - Make all of the stdlib that calls Lua functions (including those that
implicitly call metamethods) suspendable by using
lua_callk
and continuation functions. - Reimplement the Lua
coroutine
library making it one level of abstraction up from normal calls tolua_yield
, so that normal Lua coroutines can still work. We would need to implementcoroutine.resume
in a different way that does not use the C stack. We can do a transform similar to implementing "symmetric" coroutines over "asymmetric" ones here, where we implement "Lua" coroutines over our lower level "synthetic" yielding. Lua calls tocoroutine.yield
andcoroutine.resume
would now both be a yield to the calling C code, and the yielded values would tell the outer C code what to do next (whether to resume another coroutine or yield to whatever the "upper" coroutine was). As a side effect,coroutine.yieldto
becomes easy to implement. - For good measure, keep track of some unit of time cost in all callbacks, and
insert calls to
lua_yieldk
in all long running callbacks so we know that control will always return to the outer calling C code in a reasonable amount of time.
We have now reached, very roughly, the current design of piccolo
.
Rust Coroutines, Lua Coroutines, and Snarfing
In the previous section I laid out a rough explanation of how to transform
PUC-Rio Lua as it exists today and build a system similar to what piccolo
forces by design. However, I am not aware of anyone ever doing anything like
this on a grand scale.[28]I know people do tasklet systems in PUC-Rio
Lua and Luau, but I think they limit the tasklet code to very simple Lua that
doesn't require completely rewriting the Lua stdlib.
The reason for
this, I think, is simple, and that is that it is just monumentally hard to
write C callbacks this way!
The same problem exists in piccolo
though, which I alluded to near the
beginning of this post. In piccolo
, long running callbacks are represented by
a trait called
Sequence
which allows them to be suspended. More precisely, it is not so much that they
are suspended as it is that their API must mirror the outer Executor
API
in piccolo: they must be polled to completion. Now, the situation is not
nearly as bad here as trying to use lua_callk
/ lua_pcallk
/ lua_yieldk
in plain C, but fundamentally it can still be more than a little painful.
The Sequence
trait shares a lot in common with the Future
trait, in that
both represent an operation that must be polled to completion. Like I said
before when I was introducing the "stackless" design, this similarity is no
accident.
I used the slang word "snarf" casually near the beginning of this post without really explaining it. As I understand it, snarfing is something from PLT jargon where if you implement an inner programming language B in an outer programming language A, features from language A can be very easily and automatically incorporated into language B. The most common example I see here is actually garbage collection, if you implement a runtime for a garbage collected language within another garbage collected language, and you're okay with the GC semantics from the outer language being reflected in the inner language, then you can snarf garbage collection from the outer language. Think of implementing Lua in something like Go, even though the specifics of the GC semantics in Lua may not be expressible in Go,[29]I don't actually know whether Go can express all of the minutia of Lua weak / ephemeron tables and finalization. it would probably be worth it to just snarf garbage collection from Go and use plain Go pointers as Lua references.
Snarfing can also be simpler things like implementing the stdlib of the inner language using the stdlib of the outer language, in PUC-Rio Lua, there is actually a good deal of functionality snarfed from C, most of it bad (like os.setlocale).
With all of this context finally out of the way, I can say what I've wanted to
say almost from the beginning of this very long blog post: The original design
I wanted with piccolo
and gc-arena
was for Lua to snarf coroutines from
Rust. I'm going to talk about this in more detail in a future post because this
post is so long already, but Sequence
's similarity to Future
is because I
want to use Rust coroutines to implement piccolo
.
Think about it... why is PUC-Rio Lua's C interpreter written the way it is?
Why do lua_callk
and lua_pcallk
and lua_yieldk
exist at all... they exist
because C does not have coroutines. This entire post I have been dancing
around this issue without addressing it because I feared it wouldn't make
sense without a mountain of context, but the entire time I've been talking
about "reifing state" that would "normally be held inside the call stack" into
objects that can be "polled to completion"... that is the very core of what a
coroutine is.
The only real downside to gc-arena
and piccolo
is having to do this
transform manually rather than letting the Rust compiler do it. The pain of
using gc-arena
and piccolo
is THE SAME pain that existed before Rust Async
was stabilized, with Future combinator libraries having to fill the gap. In
fact, an older version of gc-arena
tried to provide combinators like this to
try and fill the gap, but making it fit into piccolo
in a generic way was just
too painful and the combinator library was dropped. piccolo::Sequence
actually
comes from the remains of this combinator library.
And all of this exists solely because I can't figure out how to make a Rust
coroutine implement gc_arena::Collect
.[30]If I sound agitated,
it's because I spent a large amount of my life force trying to make this work
somehow. It needs Rust compiler changes.
If I could figure this out,
all of the problems with gc-arena
and piccolo
could melt away, and the Rust
compiler could do the painful transformation into "stackless" interpreter design
largely for us. Even the term "stackless" is shared with Rust coroutines.
Zooming Out
I'm gonna spend a bit of time here zooming out some more. Hopefully I won't zoom out so far that I stop even being anchored to reality.
I think Rust's really cool core idea is the same that is shared by all systems programming languages: that they are meant to be the last stop in a line of other choices. I honestly don't think every single bit of software needs to be written in Rust or any systems programming language. To me, systems programming languages are languages where if you need to make system A work with system B, no matter what those systems are, you can use them. Rust and C are languages that you're supposed to use when what you're making needs to fit almost anywhere. They're supposed to be the languages with the fewest possible assumptions about how the rest of the world works, becasue they're meant to be a host or glue language to inner systems with better assumptions, which are more fit to purpose.
I know that this perspective on systems programming languages is not universal, and that the real world is actually quite resistent to putting things into neat little boxes like this, but I think this perspective is at least a useful one.
As such, I always flinch a little when I see people trying to write systems in Rust as though they're trying to figure out the one solution for something, assuming that no other solution would EVER need to exist within the same program, or even need to exist at all. I think one size fits all solutions to problems are not where Rust's strength is. Global async reactors / executors, whole-program garbage collectors,[31]Anything with global variables, really. Hating on global variables might sound kinda 90s, but that doesn't make it wrong. heck even whole program allocators,[32]Maybe Zig has some good ideas here? all of these things always make me some amount of uncomfortable because I just think that.. systems programming languages are meant for making BIG end products or libraries that last a long time, where more than one of these kinds of systems might need to exist at once, or you may need to take them apart and use them a la carte. It's not that I think these are wrong to use or wrong to make, I just don't prefer to use those kinds of solutions myself if I can avoid them because also honestly the user of my library knows better than me. There's a tradeoff in programming between flexibility and fitness for purpose, systems programming is the way it is because it's supposed to be ultimately flexible, it's what you use when a more friendly, more tailored tool isn't flexible enough. I don't like going against that idea when writing code that I want to last for a long time.[33]I also understand that compromises have to be made sometimes, and usability matters, so I genuinely mean no offense to libraries that might choose different priorities, but it might not be what I personally want.
One of my favorite parts of pretty much every version of Lua is how painless it is to have multiple copies of the interpreter. If you've ever run into large garbage collector pauses in other languages, this rarely happens in Lua not because its garbage collector is just that good, but because you aren't forced to have just ONE of them in your program, you can have as many of them as you need, each in isolation from each other! Lua is a language that actually meshes very well with my vague ideas about the core idea of systems programming, because PUC-Rio Lua was written to fit almost anywhere. It's actually amazing how neatly it fits into C, how it is fine being the bottom of the hierarchy, it's just a C library and you just call C functions. Two distant parts of a program both use Lua? Different versions of Lua with different loaded libraries? You can make it work! It doesn't read external files unless you tell it to, it doesn't do anything unless you tell it to because it makes very few assumptions. It's your tool, and it fits neatly into whatever program you already have. I think this is why it has remained so popular for so many years. [34]And so popular to integrate into big, complex systems programming projects like video games.
I want to make a version of Lua that feels for Rust like PUC-Rio feels for C,
but to go even further. I want to make a version of Lua that fits anywhere as
much as I possibly can make it. Untrusted input! Cancellation! I want to make
a version of Lua with the fewest possible opinions about how the rest of the
program is structured. I know piccolo
is a little far from that in several
ways right now, but that's my ultimate goal. I think stackless interpreters
actually fit this idea of being as unobtrusive as possible, of fitting
anywhere better than classical language interpreters.
Garbage collection systems in general are very often at odds with this idea of fitting anywhere. There can only be one boehm gc in a C program, after all. It's interesting to me that garbage collector systems as a library for C have to be so much slower than garbage collectors written for other languages but written in C. The problem is not that C can't write fast garbage collection systems, the problem is that the C language itself has so few "global assumptions" built into it. It's much easier to write a garbage collector for a language that can annotate where GC pointers are on the language's call stack or in heap allocated objects than one like C, where it is extremely difficult to have such things.
Progress in systems programming languages seems to happen when new abstractions are invented that give new fundamental powers but do NOT introduce more required assumptions. I think coroutines are one of these, and that all systems programming languages should have a stackless coroutine system because it is the sort of thing that can fit into other systems as much as possible. I think there is also some kind of deep connection between higher level languages whose compilers / interpreters do things like annotate where garbage collected pointers are stored in the language's call stack or automatically insert garbage collector safe points, and the idea of coroutines as a general reification of the call stack itself, letting the language do this manipulation rather than a specialized compiler.
I came up with this connection way back in early 2019, but if we could make
Rust coroutines implement Collect
, then this makes yield / await into an
abstraction of the garbage collector safe point. When a Future
or Coroutine
yields control to the caller, all of the (apparent) stack variables are
guaranteed to be stored inside the state of the running coroutine. This would
allow gc-arena
to easily separate collection and mutation in normal, straight
line Rust code that is simply annotated with await
s (or yield
s) to mark
where garbage collection can safely take place in the same way that a higher
level language runtime inserts safe points to mark where garbage collection
can safely take place.[35]I feel like I'm doing a math proof but I
can't really figure out a direct line between A and B but I know they're the
same, so I go from A and get as far as I can towards B, and I go from B and get
as far as I can towards A, then I wave my arms really wildly up and down so
everyone gets it.
I think Rust is so close to having some very interesting, novel powers with
its coroutines by simply being able to combine existing features together.
I can automatically serialize a custom struct with #[derive(Serialize)]
, and
I can automatically transform a function body into a state machine, but
what I cannot do is #[derive(Serialize)]
this state machine, nor can I
#[derive(Collect)]
it. Why not??
This deserves its own blog post, but I felt like I couldn't rightly close
out this post without at least mentioning it. In my next post I'm going to
explore this idea more fully and hopefully try and actually make it "work" by
pretending to be able to #[derive(Collect)]
a coroutine. I think that Rust
might need more than just this feature to make this system workable, but if
it did work, it could represent a largely painless, general, isolated system
for tracing garbage collection in Rust. A garbage collection system that can
fit anywhere.
Bye!