Bringing MultiQueue to the Nanos Unikernel Network Stack
We've made substantial changes to our networking stack. Some of those include numerous changes to our LWIP fork (which not even sure we can call it LWIP anymore as it's totally different).
This latest change allows multiple transmit and receive (tx/rx) queues. In short this allows superior network performance when you have an instance with multiple vcpus.
Modern nics, and by modern I mean ones in commodity servers for the past 10-15 years (you most assuredly have this on your system), have multiple rx/tx queues which allows sending/receiving to multiple threads. If you don't have multiple queues it means that only one thread can process incoming and outgoing network traffic. Generally speaking you don't want more queues than the number of cores you have available.
So essentially the single queue model looks something like this:
_______
| nic |
-------
______ _______
| /|\ | | rx |
| | | | | |
| tx | | \|/ |
------- -------
________
| cpu1 |
--------
and we made something that look like this instead:
_______
| nic |
-------
______ _______ _______ _______
| /|\| | rx | | /|\ | | rx |
| | | | | | | | | | | |
| tx | | \|/ | | tx | | \|/ |
|____| |_____| |_____| |_____|
________ ________
| cpu1 | | cpu2 |
-------- --------
Just to be clear linux and other systems have had this for years so this is more of a "nanos is catching up" type of feature but one that is important for scaling nonetheless. Again, if you are running on a t2.small it doesn't really matter.
I keep telling people that you can't just wave a magic wand and get superior performance. This is a prime example of the type of work that systems engineeering has to do. It is work that doesn't really have anything at all to do with unikernels but something you'll notice when comparing high core count linux machines with your unikernel instance and wondering why they might be getting better throughput on the webservers.
So what is a queue, or, in this instance more commonly known as a ring, when it comes to network interfaces anyways?
One Ring to Rule Them All, One Ring to Find Them, One Ring to Bring Them All
Have you ever had to make a queue out of two stacks or make a stack out of two queues on a whiteboard? You might remember that by creating a queue using two stacks you inherently are choosing to make enqueue or dequeue costly. You also might be more familiar with the simple linked list implementation. Just like everything in software there are many multiple approaches. Well there is another method and that is the ring buffer - which is what these queues are.
One of the differences between using a linked list implementation vs a ring buffer for a queue is space complexity since the ring buffer has a fixed size. The tradeoff here is that you can choose to have your enqueue fail (eg: drop packets in this instance) or overwrite data.
On the flip side, ring buffers are faster, especially when you know how much data you are going to want to store at a given time.
Our Implementation
By default, the virtio-net driver uses as many queues as supported by the attached device. It is possible to override this behavior by specifying the "io-queues" configuration option in the manifest tuple corresponding to a given network interface. For example, the following snippet of an ops configuration file instructs the driver to use 2 queues for the first network interface:
"ManifestPassthrough": {
"en1": {
"io-queues": "2"
}
}
Note if you are testing locally with something like iperf you'll want to ensure you have vhost enabled. (You may need to 'modprobe vhost_net'.) Vhost provides lower latency and much greater throughput. Why? It essentially moves packets between guest and host by using the host kernel bypassing qemu.
The number of queues used by the driver is always limited to the number of CPUs in the running instance (this behavior cannot be overridden by the "io-queues" option).
For optimization, each tx/rx queue is configured with an interrupt affinity such that different queues are served by different CPUs.
Locally, on the host you can see how many queues you have available by using ethtool like so:
eyberg@box:~$ ethtool -l eno2
Channel parameters for eno2:
Pre-set maximums:
RX: 8
TX: 8
Other: n/a
Combined: n/a
Current hardware settings:
RX: 7
TX: 4
Other: n/a
Combined: n/a
You can look at proc to see how they are being utilized by each thread (YMMV here per thread count):
eyberg@box:~$ cat /proc/interrupts | grep eno2 | awk {'print $28'}
eno2-0
eno2-1
eno2-2
eno2-3
eno2-4
eno2-5
eno2-6
Note: I purposely shortened the output here but the columns in between are individual threads.
You can then even watch traffic on each queue - this is useful to verify that you are indeed using what you think you are using:
watch -d -n 2 "ethtool -S eno2 | grep rx | grep packets | column"
Cloud Specific MultiQueue Settings
Now, outside of benchmarking, most of you probably don't have a strong reason to set all of this locally. So what happens when you deploy to the cloud?
The number of queues assigned on Google Cloud depends on the network interface type you are using. We support both virtio-net and gvNIC. If you are using virtio-net the equation is:
vcpu/number-of-nics
If you are using gvNIC (google's in-house network adapter that we support for and is used by the arm instances) the default count is:
2(vcpu)/number-of-nics
Furthermore virtio can have up to 32 queues, however gvnic can only have up to 16.
On AWS, if using ENA it is 1 per vcpu. Again, up to 32 max.
So go get yourself some high vcpu instances, set your iperf cannons to stun and enjoy the new multi-queue support.