I’m writing a new vector search SQLite Extension

https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html

2024-05-02 by Alex Garcia

tl;dr — sqlite-vec will be a new SQLite extension for vector search, replacing sqlite-vss. It will be an embeddable "fast enough" vector search tool, that can run anywhere SQLite runs - including WASM! It's still in active development, but check out the repo to learn when it will be ready!


I'm working on a new SQLite extension! It's called sqlite-vec, an extension for vector search, written purely in C. It's meant to replace sqlite-vss, another vector search SQLite extension I released in February 2023, which has a number of problems. I believe the approach I'm taking with sqlite-vec solves a number of problem it's predecessor has, will have a much nicer and performant SQL API, and is a better fit for all applications who want an embed vector search solution!

What sqlite-vec will be

sqlite-vec will be a SQLite extension written purely in C with no dependencies. It will provide custom SQL functions and virtual tables for fast vector search, as well as other tools and utilities for working with vectors (quantization, JSON/BLOB/numpy conversions, vector arithmetic, etc.).

A quick sample of what vector search will look like with sqlite-vec, in pure SQL:

.load ./vec0

-- a "vector store" for 8-dimensional floating point numbers
create virtual table vec_examples using vec0(
  sample_embedding float[8]
);

-- vectors can be provided as JSON or in a compact binary format
insert into vec_examples(rowid, sample_embedding)
  values
    (1, '[-0.200, 0.250, 0.341, -0.211, 0.645, 0.935, -0.316, -0.924]'),
    (2, X'E5D0E23E894100BF8FC2B53E426045BFF4FD343F7D3F35BFA4703DBE1058B93E'),
    (3, '[0.716, -0.927, 0.134, 0.052, -0.669, 0.793, -0.634, -0.162]'),
    (4, X'8FC235BFC3F5A83E9EEF273F9EEF273DA4707DBF23DB393FB81EC53E7D3F75BF');


-- KNN style query goes brrrr
  select
    rowid,
    distance
  from vec_examples
  where sample_embedding match '[0.890, 0.544, 0.825, 0.961, 0.358, 0.0196, 0.521, 0.175]'
  order by distance
  limit 2;

/*
rowid,distance
2,2.38687372207642
1,2.38978505134583
*/

Using sqlite-vec means using pure SQL, just CREATE VIRTUAL TABLE, INSERT INTO, and SELECT statements.

This work is exciting - for many reasons! First off, "written in pure C" means it will be able to run anywhere. The previous sqlite-vss extension, which had some cumbersome C++ dependencies, was only able to reliably run on Linux and MacOS machines, with binaries in the 3MB-5MB range. By contrast, sqlite-vec will run on all platforms (Linux/MacOS/Windows), in the browser with WebAssembly, and even smaller devices like mobile phones and Raspberry Pis! Smaller binaries too, in the few 100's of KB range.

Additionally, sqlite-vec has more control over memory usage. By default, vectors are stored in 'chunks' in shadow tables, and are read chunk-by-chunk during KNN searches. This means you don't need to store everything in RAM! Though if you do want in-memory speed, you could use the PRAGMA mmap_size command to make KNN searches much faster.

And finally, sqlite-vec is built in a new "era" of vector search tooling and research. There will be better support for "adative-length embeddings" (aka Matryoshka embeddings), and int8/bit vector support for binary and scalar quantization. This means more control over the speed, accuracy, and disk space that your vectors take up.

Though initially, sqlite-vec will only support exhaustive full-scan vector search. There will be no "approximate nearest neighbors" (ANN) options. But I hope to add IVF + HNSW in the future!

Demo

Enough yappin' let's see a demo

sqlite-vec is running right now in your browser! If you open up devtools, you'll see an (un-optimized) 5.9MB sqlite3.wasm file, which is the official SQLite WASM build with sqlite-vec compiled in. There is a movies.bit.db SQLite database also loaded, which is a 2.6MB SQLite database, which has this movies dataset with 4,800 movie overviews in the articles table. The separate vec_movies virtual table is a vector index of those "overviews" embedded, with Nomic's 1.5 embeddings model, quantized to binary vectors.

Here's a quick sample of what the data looks like:

Here we see the articles table has columns like title, release_date, and overview. The overview column here is important - it's a very short sentence describing the plot of the movie. We also have the vec_articles virtual table, which stores embeddings of the articles.overview as the overview_embeddings column. Thy are binary vectors with 768 dimensions, which takes up 96 bytes (768 / 8 = 96).

Now let's see how a KNN-style search works! Here's a lil' table select component. Select a movie with the radio button on the left-hand side.

The movie ID you selected will now pre-populate the :selected_movie SQL parameter in this KNN SQL query!

Those are the 10 closest movies to the one you selected! The "closest" one (using hamming distance, because it's a binary vector) will always be the same movie, with a distance of 0. Keep in mind, embedding a single-sentence plot description of a small movie dataset doesn't give the best results (and binary quantization sacrifices even more quality), but the core idea remains. Fast, "good enough" vector search, in your browser!

More docs about this KNN-style query will come soon, but in case you wanted to poke around at the internals, try adding a EXPLAIN QUERY PLAN to the beginning of the SELECT statement. You'll see the 0:knn "index" that vec_movies uses.

But what's wrong with sqlite-vss?

I won't go into all the details, but there were a number of roadblocks in the development and adoption of sqlite-vss, including:

  • Only worked on Linux + MacOS machines (no Windows, WASM, mobile devices, etc.)
  • Stored vectors all in-memory
  • Various transaction-related bugs and issues
  • Extremely hard and time-consuming to compile
  • Missing common vector operations (scalar/binary quantization)

Nearly all of these are because sqlite-vss depended on Faiss. With a lot of time and energy, some of these issues could maybe be solved, but many of them would be blocked by Faiss.

Given all this, a no-dependency and low-level solution seemed really enticing. Turns out, vector search isn't too complicated, so sqlite-vec was born!

Still not ready, but soon!

The core features of sqlite-vec work, but I have very little error handling + testing. I have 246 TODOs in the sqlite-vec.c file, which I'm tracking with a lil script:

$ make progress
deno run --allow-read=sqlite-vec.c scripts/progress.ts
Number of todo_assert()'s:      191
Number of "// TODO" comments:   41
Number of todo panics:          14
Total TODOs:                    246

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (0/246)

0% complete to sqlite-vec v0

Once those 246 TODOs are completed, then the first v0.1.0 of sqlite-vec will be released, with documentation, demos, bindings, and more! I'm aiming for a month or so, but we shall see!

Is your company interested in the success of sqlite-vec? I'd love to chat! Email me for more information.

{
"by": "sebg",
"descendants": 85,
"id": 40243168,
"kids": [
40244002,
40243906,
40246236,
40247186,
40250624,
40243851,
40245186,
40243740,
40244568,
40244410,
40249711,
40245085,
40245171,
40247018,
40244182,
40251617,
40245364,
40254337,
40300980,
40251295,
40245074,
40245576,
40245027
],
"score": 526,
"time": 1714700670,
"title": "I’m writing a new vector search SQLite Extension",
"type": "story",
"url": "https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html"
}
{
"author": null,
"date": null,
"description": "sqlite-vec is an new vector search SQLite extension, coming soon!",
"image": "https://alexgarcia.xyz/blog/1.png",
"logo": "https://logo.clearbit.com/alexgarcia.xyz",
"publisher": "Alex Garcia",
"title": "I’m writing a new vector search SQLite Extension",
"url": "https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html"
}
{
"url": "https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html",
"title": "I'm writing a new vector search SQLite Extension",
"description": "2024-05-02 by Alex Garcia tl;dr — sqlite-vec will be a new SQLite extension for vector search, replacing sqlite-vss. It will be an embeddable \"fast enough\" vector search tool, that can run anywhere...",
"links": [
"https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html"
],
"image": "https://alexgarcia.xyz/blog/1.png",
"content": "<div><p>2024-05-02 by <a target=\"_blank\" href=\"https://alexgarcia.xyz/\">Alex Garcia</a></p><section>\n<div>\n<blockquote>\n<p><em>tl;dr — <a target=\"_blank\" href=\"https://github.com/asg017/sqlite-vec\"><code>sqlite-vec</code></a> will be a new\nSQLite extension for vector search, replacing <code>sqlite-vss</code>. It will be an\nembeddable \"fast enough\" vector search tool, that can run anywhere SQLite\nruns - including WASM! It's still in active development, but\n<a target=\"_blank\" href=\"https://github.com/asg017/sqlite-vec\">check out the repo</a> to learn when it\nwill be ready!</em></p>\n</blockquote>\n<hr />\n</div>\n<p>I'm working on a new SQLite extension! It's called\n<a target=\"_blank\" href=\"https://github.com/asg017/sqlite-vec\"><code>sqlite-vec</code></a>, an extension for vector\nsearch, written purely in C. It's meant to replace\n<a target=\"_blank\" href=\"https://github.com/asg017/sqlite-vss\"><code>sqlite-vss</code></a>, another vector search\nSQLite extension I released in February 2023, which\n<a target=\"_blank\" href=\"https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html#wrong-sqlite-vss\">has a number of problems</a>. I believe the approach I'm taking\nwith <code>sqlite-vec</code> solves a number of problem it's predecessor has, will have a\nmuch nicer and performant SQL API, and is a better fit for all applications who\nwant an embed vector search solution!</p>\n<h2 id=\"what-sqlite-vec-will-be\"><a target=\"_blank\" href=\"https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html#what-sqlite-vec-will-be\">¶</a> What <code>sqlite-vec</code> will be</h2>\n<p><code>sqlite-vec</code> will be a SQLite extension written purely in C with no\ndependencies. It will provide custom SQL functions and virtual tables for fast\nvector search, as well as other tools and utilities for working with vectors\n(quantization, JSON/BLOB/numpy conversions, vector arithmetic, etc.).</p>\n<p>A quick sample of what vector search will look like with <code>sqlite-vec</code>, in pure\nSQL:</p>\n<pre><code><span><span>.</span><span>load</span><span> .</span><span>/</span><span>vec0</span></span>\n<span></span>\n<span><span>-- a \"vector store\" for 8-dimensional floating point numbers</span></span>\n<span><span>create</span><span> virtual</span><span> table</span><span> vec_examples </span><span>using</span><span> vec0(</span></span>\n<span><span> sample_embedding </span><span>float</span><span>[8]</span></span>\n<span><span>);</span></span>\n<span></span>\n<span><span>-- vectors can be provided as JSON or in a compact binary format</span></span>\n<span><span>insert into</span><span> vec_examples(rowid, sample_embedding)</span></span>\n<span><span> values</span></span>\n<span><span> (</span><span>1</span><span>, </span><span>'[-0.200, 0.250, 0.341, -0.211, 0.645, 0.935, -0.316, -0.924]'</span><span>),</span></span>\n<span><span> (</span><span>2</span><span>, X</span><span>'E5D0E23E894100BF8FC2B53E426045BFF4FD343F7D3F35BFA4703DBE1058B93E'</span><span>),</span></span>\n<span><span> (</span><span>3</span><span>, </span><span>'[0.716, -0.927, 0.134, 0.052, -0.669, 0.793, -0.634, -0.162]'</span><span>),</span></span>\n<span><span> (</span><span>4</span><span>, X</span><span>'8FC235BFC3F5A83E9EEF273F9EEF273DA4707DBF23DB393FB81EC53E7D3F75BF'</span><span>);</span></span>\n<span></span>\n<span></span>\n<span><span>-- KNN style query goes brrrr</span></span>\n<span><span> select</span></span>\n<span><span> rowid,</span></span>\n<span><span> distance</span></span>\n<span><span> from</span><span> vec_examples</span></span>\n<span><span> where</span><span> sample_embedding match </span><span>'[0.890, 0.544, 0.825, 0.961, 0.358, 0.0196, 0.521, 0.175]'</span></span>\n<span><span> order by</span><span> distance</span></span>\n<span><span> limit</span><span> 2</span><span>;</span></span>\n<span></span>\n<span><span>/*</span></span>\n<span><span>rowid,distance</span></span>\n<span><span>2,2.38687372207642</span></span>\n<span><span>1,2.38978505134583</span></span>\n<span><span>*/</span></span></code></pre>\n<p>Using <code>sqlite-vec</code> means using pure SQL, just <code>CREATE VIRTUAL TABLE</code>,\n<code>INSERT INTO</code>, and <code>SELECT</code> statements.</p>\n<p>This work is exciting - for many reasons! First off, \"written in pure C\" means\nit will be able to run anywhere. The previous <code>sqlite-vss</code> extension, which had\nsome cumbersome C++ dependencies, was only able to reliably run on Linux and\nMacOS machines, with binaries in the <code>3MB</code>-<code>5MB</code> range. By contrast,\n<code>sqlite-vec</code> will run on all platforms (Linux/MacOS/Windows), in the browser\nwith WebAssembly, and even smaller devices like mobile phones and Raspberry Pis!\nSmaller binaries too, in the few 100's of KB range.</p>\n<p>Additionally, <code>sqlite-vec</code> has more control over memory usage. By default,\nvectors are stored in 'chunks' in shadow tables, and are read chunk-by-chunk\nduring KNN searches. This means you don't need to store everything in RAM!\nThough if you do want in-memory speed, you could use the\n<a target=\"_blank\" href=\"https://www.sqlite.org/pragma.html#pragma_mmap_size\"><code>PRAGMA mmap_size</code></a>\ncommand to make KNN searches much faster.</p>\n<p>And finally, <code>sqlite-vec</code> is built in a new \"era\" of vector search tooling and\nresearch. There will be better support for \"adative-length embeddings\" (aka\n<a target=\"_blank\" href=\"https://huggingface.co/blog/matryoshka\">Matryoshka embeddings</a>), and\n<code>int8</code>/<code>bit</code> vector support for\n<a target=\"_blank\" href=\"https://huggingface.co/blog/embedding-quantization\">binary and scalar quantization</a>.\nThis means more control over the speed, accuracy, and disk space that your\nvectors take up.</p>\n<p>Though initially, <code>sqlite-vec</code> will only support exhaustive full-scan vector\nsearch. There will be no \"approximate nearest neighbors\" (ANN) options. But I\nhope to add IVF + HNSW in the future!</p>\n<h2 id=\"demo\"><a target=\"_blank\" href=\"https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html#demo\">¶</a> Demo</h2>\n<p>Enough yappin' let's see a demo</p>\n<p><code>sqlite-vec</code> is running right now in your browser! If you open up devtools,\nyou'll see an (un-optimized) <code>5.9MB</code> <code>sqlite3.wasm</code> file, which is the\n<a target=\"_blank\" href=\"https://sqlite.org/wasm/doc/trunk/about.md\">official SQLite WASM build</a> with\n<code>sqlite-vec</code> compiled in. There is a <code>movies.bit.db</code> SQLite database also\nloaded, which is a <code>2.6MB</code> SQLite database, which has\n<a target=\"_blank\" href=\"https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv\">this movies dataset</a>\nwith 4,800 movie overviews in the <code>articles</code> table. The separate <code>vec_movies</code>\nvirtual table is a vector index of those \"overviews\" embedded, with\n<a target=\"_blank\" href=\"https://blog.nomic.ai/posts/nomic-embed-matryoshka\">Nomic's 1.5 embeddings model</a>,\nquantized to binary vectors.</p>\n<p>Here's a quick sample of what the data looks like:</p>\n<p>Here we see the <code>articles</code> table has columns like <code>title</code>, <code>release_date</code>, and\n<code>overview</code>. The <code>overview</code> column here is important - it's a very short sentence\ndescribing the plot of the movie. We also have the <code>vec_articles</code> virtual table,\nwhich stores embeddings of the <code>articles.overview</code> as the <code>overview_embeddings</code>\ncolumn. Thy are binary vectors with 768 dimensions, which takes up 96 bytes\n(<code>768 / 8 = 96</code>).</p>\n<p>Now let's see how a KNN-style search works! Here's a lil' table select\ncomponent. Select a movie with the radio button on the left-hand side.</p>\n<p>The movie ID you selected will now pre-populate the <code>:selected_movie</code> SQL\nparameter in this KNN SQL query!</p>\n<p>Those are the 10 closest movies to the one you selected! The \"closest\" one\n(using hamming distance, because it's a binary vector) will always be the same\nmovie, with a distance of 0. Keep in mind, embedding a single-sentence plot\ndescription of a small movie dataset doesn't give the best results (and binary\nquantization sacrifices even more quality), but the core idea remains. Fast,\n\"good enough\" vector search, in your browser!</p>\n<p>More docs about this KNN-style query will come soon, but in case you wanted to\npoke around at the internals, try adding a <code>EXPLAIN QUERY PLAN</code> to the beginning\nof the <code>SELECT</code> statement. You'll see the <code>0:knn</code> \"index\" that <code>vec_movies</code>\nuses.</p>\n<h2 id=\"wrong-sqlite-vss\"> But what's wrong with <code>sqlite-vss</code>?</h2>\n<p>I won't go into all the details, but there were a number of roadblocks in the\ndevelopment and adoption of <code>sqlite-vss</code>, including:</p>\n<ul>\n<li>Only worked on Linux + MacOS machines (no Windows, WASM, mobile devices, etc.)</li>\n<li>Stored vectors all in-memory</li>\n<li>Various transaction-related bugs and issues</li>\n<li>Extremely hard and time-consuming to compile</li>\n<li>Missing common vector operations (scalar/binary quantization)</li>\n</ul>\n<p>Nearly all of these are because <code>sqlite-vss</code> depended on\n<a target=\"_blank\" href=\"https://github.com/facebookresearch/faiss\">Faiss</a>. With a lot of time and\nenergy, some of these issues could <em>maybe</em> be solved, but many of them would be\nblocked by Faiss.</p>\n<p>Given all this, a no-dependency and low-level solution seemed really enticing.\nTurns out, vector search isn't too complicated, so <code>sqlite-vec</code> was born!</p>\n<h2 id=\"still-not-ready-but-soon\"><a target=\"_blank\" href=\"https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html#still-not-ready-but-soon\">¶</a> Still not ready, but soon!</h2>\n<p>The core features of <code>sqlite-vec</code> work, but I have very little error handling +\ntesting. I have 246 TODOs in the <code>sqlite-vec.c</code> file, which I'm tracking with a\nlil script:</p>\n<pre><code><span><span>$ make progress</span></span>\n<span><span>deno run --allow-read=sqlite-vec.c scripts/progress.ts</span></span>\n<span><span>Number of todo_assert()'s: 191</span></span>\n<span><span>Number of \"// TODO\" comments: 41</span></span>\n<span><span>Number of todo panics: 14</span></span>\n<span><span>Total TODOs: 246</span></span>\n<span><span></span></span>\n<span><span>░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (0/246)</span></span>\n<span><span></span></span>\n<span><span>0% complete to sqlite-vec v0</span></span></code></pre>\n<p>Once those 246 TODOs are completed, then the first <code>v0.1.0</code> of <code>sqlite-vec</code> will\nbe released, with documentation, demos, bindings, and more! I'm aiming for a\nmonth or so, but we shall see!</p>\n<p>Is your company interested in the success of <code>sqlite-vec</code>? I'd love to chat!\n<a target=\"_blank\" href=\"https://alexgarcia.xyz/\">Email me</a> for more information.</p>\n</section></div>",
"author": "@agarcia_me",
"favicon": "https://alexgarcia.xyz/vite.svg",
"source": "alexgarcia.xyz",
"published": "2024-05-02T00:00:00.000Z",
"ttr": 215,
"type": "article"
}