Show HN: SpRAG – Open-source RAG implementation for challenging real-world tasks

https://github.com/SuperpoweredAI/spRAG

dsRAG

Discord

Note: If you’re using (or planning to use) dsRAG in production, please fill out this short form telling us about your use case. This helps us prioritize new features. In return I’ll give you my personal email address, which you can use for priority email support.

What is dsRAG?

dsRAG is a retrieval engine for unstructured data. It is especially good at handling challenging queries over dense text, like financial reports, legal documents, and academic papers. dsRAG achieves substantially higher accuracy than vanilla RAG baselines on complex open-book question answering tasks. On one especially challenging benchmark, FinanceBench, dsRAG gets accurate answers 83% of the time, compared to the vanilla RAG baseline which only gets 19% of questions correct.

There are three key methods used to improve performance over vanilla RAG systems:

  1. Semantic sectioning
  2. AutoContext
  3. Relevant Segment Extraction (RSE)

Semantic sectioning

Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section. These section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.

AutoContext (contextual chunk headers)

AutoContext creates contextual chunk headers that contain document-level and section-level context, and prepends those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, AutoContext also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.

Relevant Segment Extraction

Relevant Segment Extraction (RSE) is a query-time post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can. For simple factual questions, the answer is usually contained in a single chunk; but for more complex questions, the answer usually spans a longer section of text. The goal of RSE is to intelligently identify the section(s) of text that provide the most relevant information, without being constrained to fixed length chunks.

For example, suppose you have a bunch of SEC filings in a knowledge base and you ask “What were Apple’s key financial results in the most recent fiscal year?” RSE will identify the most relevant segment as the entire “Consolidated Statement of Operations” section, which will be 5-10 chunks long. Whereas if you ask “Who is Apple’s CEO?” the most relevant segment will be identified as a single chunk that mentions “Tim Cook, CEO.”

Eval results

We've evaluated dsRAG on a couple of end-to-end RAG benchmarks.

FinanceBench

First, we have FinanceBench. This benchmark uses a corpus of a few hundred 10-Ks and 10-Qs. The queries are challenging, and often require combining multiple pieces of information. Ground truth answers are provided. Answers are graded manually on a pass/fail basis. Minor allowances for rounding errors are allowed, but other than that the answer must exactly match the ground truth answer to be considered correct.

The baseline retrieval pipeline, which uses standard chunking and top-k retrieval, achieves a score of 19% according to the paper. dsRAG, using default parameters and AutoQuery for query generation, achieves a score of 83%.

KITE

We couldn't find any other suitable end-to-end RAG benchmarks, so we decided to create our own, called KITE (Knowledge-Intensive Task Evaluation).

KITE currently consists of 4 datasets and a total of 50 questions.

  • AI Papers - ~100 academic papers about AI and RAG, downloaded from arXiv in PDF form.
  • BVP Cloud 10-Ks - 10-Ks for all companies in the Bessemer Cloud Index (~70 of them), in PDF form.
  • Sourcegraph Company Handbook - ~800 markdown files, with their original directory structure, downloaded from Sourcegraph's publicly accessible company handbook GitHub page.
  • Supreme Court Opinions - All Supreme Court opinions from Term Year 2022 (delivered from January '23 to June '23), downloaded from the official Supreme Court website in PDF form.

Ground truth answers are included with each sample. Most samples also include grading rubrics. Grading is done on a scale of 0-10 for each question, with a strong LLM doing the grading.

We tested four configurations:

  • Top-k retrieval (baseline)
  • Relevant segment extraction (RSE)
  • Top-k retrieval with contextual chunk headers (CCH)
  • CCH+RSE (dsRAG default config, minus semantic sectioning)

Testing RSE and CCH on their own, in addition to testing them together, lets us see the individual contributions of those two features.

Cohere English embeddings and the Cohere 3 English reranker were used for all configurations. LLM responses were generated with GPT-4o, and grading was also done with GPT-4o.

Top-k RSE CCH+Top-k CCH+RSE
AI Papers 4.5 7.9 4.7 7.9
BVP Cloud 2.6 4.4 6.3 7.8
Sourcegraph 5.7 6.6 5.8 9.4
Supreme Court Opinions 6.1 8.0 7.4 8.5
Average 4.72 6.73 6.04 8.42

Using CCH and RSE together leads to a dramatic improvement in performance, from 4.72 -> 8.42. Looking at the RSE and CCH+Top-k results, we can see that using each of those features individually leads to a large improvement over the baseline, with RSE appearing to be slightly more important than CCH.

To put these results in perspective, we also tested the CCH+RSE configuration with a smaller model, GPT-4o Mini. As expected, this led to a decrease in performance compared to using GPT-4o, but the difference was surprisingly small (7.95 vs. 8.42). Using CCH+RSE with GPT-4o Mini dramatically outperforms the baseline RAG pipeline even though the baseline uses a 17x more expensive LLM. This suggests that the LLM plays a much smaller role in end-to-end RAG system accuracy than the retrieval pipeline does.

CCH+RSE (GPT-4o) CCH+RSE (GPT-4o Mini)
AI Papers 7.9 7.0
BVP Cloud 7.8 7.9
Sourcegraph 9.4 8.5
Supreme Court Opinions 8.5 8.4
Average 8.42 7.95

Note: we did not use semantic sectioning for any of the configurations tested here. We'll evaluate that one separately once we finish some of the improvements we're working on for it. We also did not use AutoQuery, as the KITE questions are all suitable for direct use as search queries.

Tutorial

Installation

To install the python package, run

Quickstart

By default, dsRAG uses OpenAI for embeddings and AutoContext, and Cohere for reranking, so to run the code below you'll need to make sure you have API keys for those providers set as environmental variables with the following names: OPENAI_API_KEY and CO_API_KEY. If you want to run dsRAG with different models, take a look at the "Basic customization" section below.

You can create a new KnowledgeBase directly from a file using the create_kb_from_file helper function:

from dsrag.create_kb import create_kb_from_file
file_path = "dsRAG/tests/data/levels_of_agi.pdf"
kb_id = "levels_of_agi"
kb = create_kb_from_file(kb_id, file_path)

KnowledgeBase objects persist to disk automatically, so you don't need to explicitly save it at this point.

Now you can load the KnowledgeBase by its kb_id (only necessary if you run this from a separate script) and query it using the query method:

from dsrag.knowledge_base import KnowledgeBase
kb = KnowledgeBase("levels_of_agi")
search_queries = ["What are the levels of AGI?", "What is the highest level of AGI?"]
results = kb.query(search_queries)
for segment in results:
    print(segment)

Basic customization

Now let's look at an example of how we can customize the configuration of a KnowledgeBase. In this case, we'll customize it so that it only uses OpenAI (useful if you don't have API keys for Anthropic and Cohere). To do so, we need to pass in a subclass of LLM and a subclass of Reranker. We'll use gpt-4o-mini for the LLM (this is what gets used for document and section summarization in AutoContext) and since OpenAI doesn't offer a reranker, we'll use the NoReranker class for that.

from dsrag.llm import OpenAIChatAPI
from dsrag.reranker import NoReranker
llm = OpenAIChatAPI(model='gpt-4o-mini')
reranker = NoReranker()
kb = KnowledgeBase(kb_id="levels_of_agi", reranker=reranker, auto_context_model=llm)

Now we can add documents to this KnowledgeBase using the add_document method. Note that the add_document method takes in raw text, not files, so we'll have to extract the text from our file first. There are some utility functions for doing this in the document_parsing.py file.

from dsrag.document_parsing import extract_text_from_pdf
file_path = "dsRAG/tests/data/levels_of_agi.pdf"
text = extract_text_from_pdf(file_path)
kb.add_document(doc_id=file_path, text=text)

Architecture

KnowledgeBase object

A KnowledgeBase object takes in documents (in the form of raw text) and does chunking and embedding on them, along with a few other preprocessing operations. Then at query time you feed in queries and it returns the most relevant segments of text.

KnowledgeBase objects are persistent by default. The full configuration needed to reconstruct the object gets saved as a JSON file upon creation and updating.

Components

There are five key components that define the configuration of a KnowledgeBase, each of which are customizable:

  1. VectorDB
  2. ChunkDB
  3. Embedding
  4. Reranker
  5. LLM
  6. FileSystem

There are defaults for each of these components, as well as alternative options included in the repo. You can also define fully custom components by subclassing the base classes and passing in an instance of that subclass to the KnowledgeBase constructor.

VectorDB

The VectorDB component stores the embedding vectors, as well as a small amount of metadata.

The currently available options are:

  • BasicVectorDB
  • WeaviateVectorDB
  • ChromaDB
  • QdrantVectorDB
  • MilvusDB

ChunkDB

The ChunkDB stores the content of text chunks in a nested dictionary format, keyed on doc_id and chunk_index. This is used by RSE to retrieve the full text associated with specific chunks.

The currently available options are:

  • BasicChunkDB
  • SQLiteDB

Embedding

The Embedding component defines the embedding model.

The currently available options are:

  • OpenAIEmbedding
  • CohereEmbedding
  • VoyageAIEmbedding
  • OllamaEmbedding

Reranker

The Reranker components define the reranker. This is used after the vector database search (and before RSE) to provide a more accurate ranking of chunks.

The currently available options are:

  • CohereReranker
  • VoyageReranker

LLM

This defines the LLM to be used for document title generation, document summarization, and section summarization in AutoContext.

The currently available options are:

  • OpenAIChatAPI
  • AnthropicChatAPI
  • OllamaChatAPI

FileSystem

This defines the file system to be used for saving PDF images.

For backwards compatibility, if an existing KnowledgeBase is loaded in, a LocalFileSystem will be created by default using the storage_directory. This is also the behavior if no FileSystem is defined when creating a new KnowledgeBase.

The currently available options are:

  • LocalFileSystem
  • S3FileSystem

Usage: For the LocalFileSystem, only a base_path needs to be passed in. This defines where the files will be stored on the system For the S3FileSystem, the following parameters are needed:

  • base_path
  • bucket_name
  • region_name
  • access_key
  • access_secret

The base_path is used when downloading files from S3. The files have to be stored locally in order to be used in the retrieval system.

Config dictionaries

Since there are a lot of configuration parameters available, they're organized into a few config dictionaries. There are four config dictionaries that can be passed in to add_document (auto_context_config, file_parsing_config, semantic_sectioning_config, and chunking_config) and one that can be passed in to query (rse_params).

Default values will be used for any parameters not provided in these dictionaries, so if you just want to alter one or two parameters there's no need to send in the full dictionary.

auto_context_config

  • use_generated_title: bool - whether to use an LLM-generated title if no title is provided (default is True)
  • document_title_guidance: str - guidance for generating the document title
  • get_document_summary: bool - whether to get a document summary (default is True)
  • document_summarization_guidance: str
  • get_section_summaries: bool - whether to get section summaries (default is False)
  • section_summarization_guidance: str

file_parsing_config

  • use_vlm: bool - whether to use VLM (vision language model) for parsing the file (default is False)
  • vlm_config: a dictionary with configuration parameters for VLM (ignored if use_vlm is False)
    • provider: the VLM provider to use - only "vertex_ai" is supported at the moment
    • model: the VLM model to use
    • project_id: the GCP project ID (required if provider is "vertex_ai")
    • location: the GCP location (required if provider is "vertex_ai")
    • save_path: the path to save intermediate files created during VLM processing
    • exclude_elements: a list of element types to exclude from the parsed text. Default is ["Header", "Footer"].

semantic_sectioning_config

  • llm_provider: the LLM provider to use for semantic sectioning - only "openai" and "anthropic" are supported at the moment
  • model: the LLM model to use for semantic sectioning
  • use_semantic_sectioning: if False, semantic sectioning will be skipped (default is True)

chunking_config

  • chunk_size: the maximum number of characters to include in each chunk
  • min_length_for_chunking: the minimum length of text to allow chunking (measured in number of characters); if the text is shorter than this, it will be added as a single chunk. If semantic sectioning is used, this parameter will be applied to each section. Setting this to a higher value than the chunk_size can help avoid unnecessary chunking of short documents or sections.

rse_params

  • max_length: maximum length of a segment, measured in number of chunks
  • overall_max_length: maximum length of all segments combined, measured in number of chunks
  • minimum_value: minimum value of a segment, measured in relevance value
  • irrelevant_chunk_penalty: float between 0 and 1
  • overall_max_length_extension: the maximum length of all segments combined will be increased by this amount for each additional query beyond the first
  • decay_rate
  • top_k_for_document_selection: the number of documents to consider
  • chunk_length_adjustment: bool, if True (default) then scale the chunk relevance values by their length before calculating segment relevance values

Metadata query filters

Certain vector DBs support metadata filtering when running a query (currently only ChromaDB). This allows you to have more control over what document(s) get searched. A common use case for this would be asking questions over a single document in a knowledge base, in which case you would supply the doc_id as a metadata filter.

The format of the metadata filtering is an object with the following keys:

  • field: str, # The metadata field to filter by
  • operator: str, # The operator for filtering. Must be one of: 'equals', 'not_equals', 'in', 'not_in', 'greater_than', 'less_than', 'greater_than_equals', 'less_than_equals'
  • value: str | int | float | list # If the value is a list, every item in the list must be of the same type

Example

# Filter with the "equals" operator
metadata_filter = {
    "field": "doc_id",
    "operator": "equals",
    "value": "test_id_1"
}
# Filter with the "in" operator
metadata_filter = {
    "field": "doc_id",
    "operator": "in",
    "value": ["test_id_1", "test_id_2"]
}

Document upload flow

Documents -> semantic sectioning -> chunking -> AutoContext -> embedding -> chunk and vector database upsert

Query flow

Queries -> vector database search -> reranking -> RSE -> results

Community and support

You can join our Discord to ask questions, make suggestions, and discuss contributions.

If you’re using (or planning to use) dsRAG in production, please fill out this short form telling us about your use case. This helps us prioritize new features. In return I’ll give you my personal email address, which you can use for priority email support.

Private cloud deployment

If you want to run dsRAG in production with minimal effort, reach out to us about our commercial offering, which is a managed private cloud deployment of dsRAG.

Here are the high-level details of the offering:

Private cloud deployment (i.e. in your own AWS, Azure, or GCP account) of dsRAG.

  • Deployed as a production-ready API with endpoints for adding and deleting documents, viewing upload status, querying, etc.
  • Unlimited number of KnowledgeBases. You can just pass in the kb_id with each API call to specify which one it’s for.
  • Document upload queue with configurable concurrency limits so you don’t have to worry about rate limiting document uploads in your application code.
  • VectorDB and ChunkDB are created and managed as part of the API, so you don’t have to set those up separately.
  • Could also be deployed directly into your customers’ cloud environments if needed.

Support

  • We’ll help you customize the retrieval configuration and components for your use case and make sure everything runs smoothly and performs well.
  • Ongoing support and regular updates as needed.

If this is something you’d like to learn more about, fill out this short form and we’ll reach out ASAP.

{
"by": "zmccormick7",
"descendants": 23,
"id": 40237546,
"kids": [
40241283,
40240860,
40240628,
40240983,
40240842,
40241766,
40240784,
40251364,
40245210
],
"score": 69,
"text": "Hey HN, I’m Zach from Superpowered AI (YC S22). We’ve been working in the RAG space for a little over a year now, and we’ve recently decided to open-source all of our core retrieval tech.<p>spRAG is a retrieval system that’s designed to handle complex real-world queries over dense text, like legal documents and financial reports. As far as we know, it produces the most accurate and reliable results of any RAG system for these kinds of tasks. For example, on FinanceBench, which is an especially challenging open-book financial question answering benchmark, spRAG gets 83% of questions correct, compared to 19% for the vanilla RAG baseline (which uses Chroma + OpenAI Ada embeddings + LangChain).<p>You can find more info about how it works and how to use it in the project’s README. We’re also very open to contributions. We especially need contributions around integrations (i.e. adding support for more vector DBs, embedding models, etc.) and around evaluation.",
"time": 1714664691,
"title": "Show HN: SpRAG – Open-source RAG implementation for challenging real-world tasks",
"type": "story",
"url": "https://github.com/SuperpoweredAI/spRAG"
}
{
"author": "D-Star-AI",
"date": null,
"description": "High-performance retrieval engine for unstructured data - D-Star-AI/dsRAG",
"image": "https://opengraph.githubassets.com/8d7c8fe003764b35c1ce3184176b463891a5b005bdc71c67534a68d817d89d0f/D-Star-AI/dsRAG",
"logo": "https://logo.clearbit.com/github.com",
"publisher": "GitHub",
"title": "GitHub - D-Star-AI/dsRAG: High-performance retrieval engine for unstructured data",
"url": "https://github.com/D-Star-AI/dsRAG"
}
{
"url": "https://github.com/D-Star-AI/dsRAG",
"title": "GitHub - D-Star-AI/dsRAG: High-performance retrieval engine for unstructured data",
"description": "dsRAG Note: If you’re using (or planning to use) dsRAG in production, please fill out this short form telling us about your use case. This helps us prioritize new features. In return I’ll give you my...",
"links": [
"https://github.com/D-Star-AI/dsRAG",
"https://github.com/SuperpoweredAI/spRAG"
],
"image": "https://opengraph.githubassets.com/8d7c8fe003764b35c1ce3184176b463891a5b005bdc71c67534a68d817d89d0f/D-Star-AI/dsRAG",
"content": "<div><article><p></p><h2>dsRAG</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#dsrag\"></a><p></p>\n<p><a target=\"_blank\" href=\"https://discord.gg/NTUVX9DmQ3\"><img src=\"https://camo.githubusercontent.com/578c743f91bdca4dda05f6b2ae7174c9c5e674d36c256b854c75aae1bab63e1a/68747470733a2f2f696d672e736869656c64732e696f2f646973636f72642f313233343632393238303735353837353838312e7376673f6c6162656c3d446973636f7264266c6f676f3d646973636f726426636f6c6f723d373238394441\" alt=\"Discord\" /></a></p>\n<p><strong>Note:</strong> If you’re using (or planning to use) dsRAG in production, please fill out this short <a target=\"_blank\" href=\"https://forms.gle/RQ5qFVReonSHDcCu5\">form</a> telling us about your use case. This helps us prioritize new features. In return I’ll give you my personal email address, which you can use for priority email support.</p>\n<p></p><h2>What is dsRAG?</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#what-is-dsrag\"></a><p></p>\n<p>dsRAG is a retrieval engine for unstructured data. It is especially good at handling challenging queries over dense text, like financial reports, legal documents, and academic papers. dsRAG achieves substantially higher accuracy than vanilla RAG baselines on complex open-book question answering tasks. On one especially challenging benchmark, <a target=\"_blank\" href=\"https://arxiv.org/abs/2311.11944\">FinanceBench</a>, dsRAG gets accurate answers 83% of the time, compared to the vanilla RAG baseline which only gets 19% of questions correct.</p>\n<p>There are three key methods used to improve performance over vanilla RAG systems:</p>\n<ol>\n<li>Semantic sectioning</li>\n<li>AutoContext</li>\n<li>Relevant Segment Extraction (RSE)</li>\n</ol>\n<p></p><h4>Semantic sectioning</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#semantic-sectioning\"></a><p></p>\n<p>Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section. These section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.</p>\n<p></p><h4>AutoContext (contextual chunk headers)</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#autocontext-contextual-chunk-headers\"></a><p></p>\n<p>AutoContext creates contextual chunk headers that contain document-level and section-level context, and prepends those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, AutoContext also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.</p>\n<p></p><h4>Relevant Segment Extraction</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#relevant-segment-extraction\"></a><p></p>\n<p>Relevant Segment Extraction (RSE) is a query-time post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can. For simple factual questions, the answer is usually contained in a single chunk; but for more complex questions, the answer usually spans a longer section of text. The goal of RSE is to intelligently identify the section(s) of text that provide the most relevant information, without being constrained to fixed length chunks.</p>\n<p>For example, suppose you have a bunch of SEC filings in a knowledge base and you ask “What were Apple’s key financial results in the most recent fiscal year?” RSE will identify the most relevant segment as the entire “Consolidated Statement of Operations” section, which will be 5-10 chunks long. Whereas if you ask “Who is Apple’s CEO?” the most relevant segment will be identified as a single chunk that mentions “Tim Cook, CEO.”</p>\n<p></p><h2>Eval results</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#eval-results\"></a><p></p>\n<p>We've evaluated dsRAG on a couple of end-to-end RAG benchmarks.</p>\n<p></p><h4>FinanceBench</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#financebench\"></a><p></p>\n<p>First, we have <a target=\"_blank\" href=\"https://arxiv.org/abs/2311.11944\">FinanceBench</a>. This benchmark uses a corpus of a few hundred 10-Ks and 10-Qs. The queries are challenging, and often require combining multiple pieces of information. Ground truth answers are provided. Answers are graded manually on a pass/fail basis. Minor allowances for rounding errors are allowed, but other than that the answer must exactly match the ground truth answer to be considered correct.</p>\n<p>The baseline retrieval pipeline, which uses standard chunking and top-k retrieval, achieves a score of <strong>19%</strong> according to the paper. dsRAG, using default parameters and AutoQuery for query generation, achieves a score of <strong>83%</strong>.</p>\n<p></p><h4>KITE</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#kite\"></a><p></p>\n<p>We couldn't find any other suitable end-to-end RAG benchmarks, so we decided to create our own, called <a target=\"_blank\" href=\"https://github.com/D-Star-AI/KITE\">KITE</a> (Knowledge-Intensive Task Evaluation).</p>\n<p>KITE currently consists of 4 datasets and a total of 50 questions.</p>\n<ul>\n<li><strong>AI Papers</strong> - ~100 academic papers about AI and RAG, downloaded from arXiv in PDF form.</li>\n<li><strong>BVP Cloud 10-Ks</strong> - 10-Ks for all companies in the Bessemer Cloud Index (~70 of them), in PDF form.</li>\n<li><strong>Sourcegraph Company Handbook</strong> - ~800 markdown files, with their original directory structure, downloaded from Sourcegraph's publicly accessible company handbook GitHub <a target=\"_blank\" href=\"https://github.com/sourcegraph/handbook/tree/main/content\">page</a>.</li>\n<li><strong>Supreme Court Opinions</strong> - All Supreme Court opinions from Term Year 2022 (delivered from January '23 to June '23), downloaded from the official Supreme Court <a target=\"_blank\" href=\"https://www.supremecourt.gov/opinions/slipopinion/22\">website</a> in PDF form.</li>\n</ul>\n<p>Ground truth answers are included with each sample. Most samples also include grading rubrics. Grading is done on a scale of 0-10 for each question, with a strong LLM doing the grading.</p>\n<p>We tested four configurations:</p>\n<ul>\n<li>Top-k retrieval (baseline)</li>\n<li>Relevant segment extraction (RSE)</li>\n<li>Top-k retrieval with contextual chunk headers (CCH)</li>\n<li>CCH+RSE (dsRAG default config, minus semantic sectioning)</li>\n</ul>\n<p>Testing RSE and CCH on their own, in addition to testing them together, lets us see the individual contributions of those two features.</p>\n<p>Cohere English embeddings and the Cohere 3 English reranker were used for all configurations. LLM responses were generated with GPT-4o, and grading was also done with GPT-4o.</p>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Top-k</th>\n<th>RSE</th>\n<th>CCH+Top-k</th>\n<th>CCH+RSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>AI Papers</td>\n<td>4.5</td>\n<td>7.9</td>\n<td>4.7</td>\n<td>7.9</td>\n</tr>\n<tr>\n<td>BVP Cloud</td>\n<td>2.6</td>\n<td>4.4</td>\n<td>6.3</td>\n<td>7.8</td>\n</tr>\n<tr>\n<td>Sourcegraph</td>\n<td>5.7</td>\n<td>6.6</td>\n<td>5.8</td>\n<td>9.4</td>\n</tr>\n<tr>\n<td>Supreme Court Opinions</td>\n<td>6.1</td>\n<td>8.0</td>\n<td>7.4</td>\n<td>8.5</td>\n</tr>\n<tr>\n<td><strong>Average</strong></td>\n<td>4.72</td>\n<td>6.73</td>\n<td>6.04</td>\n<td>8.42</td>\n</tr>\n</tbody>\n</table>\n<p>Using CCH and RSE together leads to a dramatic improvement in performance, from 4.72 -&gt; 8.42. Looking at the RSE and CCH+Top-k results, we can see that using each of those features individually leads to a large improvement over the baseline, with RSE appearing to be slightly more important than CCH.</p>\n<p>To put these results in perspective, we also tested the CCH+RSE configuration with a smaller model, GPT-4o Mini. As expected, this led to a decrease in performance compared to using GPT-4o, but the difference was surprisingly small (7.95 vs. 8.42). Using CCH+RSE with GPT-4o Mini dramatically outperforms the baseline RAG pipeline even though the baseline uses a 17x more expensive LLM. This suggests that the LLM plays a much smaller role in end-to-end RAG system accuracy than the retrieval pipeline does.</p>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>CCH+RSE (GPT-4o)</th>\n<th>CCH+RSE (GPT-4o Mini)</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>AI Papers</td>\n<td>7.9</td>\n<td>7.0</td>\n</tr>\n<tr>\n<td>BVP Cloud</td>\n<td>7.8</td>\n<td>7.9</td>\n</tr>\n<tr>\n<td>Sourcegraph</td>\n<td>9.4</td>\n<td>8.5</td>\n</tr>\n<tr>\n<td>Supreme Court Opinions</td>\n<td>8.5</td>\n<td>8.4</td>\n</tr>\n<tr>\n<td><strong>Average</strong></td>\n<td>8.42</td>\n<td>7.95</td>\n</tr>\n</tbody>\n</table>\n<p>Note: we did not use semantic sectioning for any of the configurations tested here. We'll evaluate that one separately once we finish some of the improvements we're working on for it. We also did not use AutoQuery, as the KITE questions are all suitable for direct use as search queries.</p>\n<p></p><h2>Tutorial</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#tutorial\"></a><p></p>\n<p></p><h4>Installation</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#installation\"></a><p></p>\n<p>To install the python package, run</p>\n<p></p><h4>Quickstart</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#quickstart\"></a><p></p>\n<p>By default, dsRAG uses OpenAI for embeddings and AutoContext, and Cohere for reranking, so to run the code below you'll need to make sure you have API keys for those providers set as environmental variables with the following names: <code>OPENAI_API_KEY</code> and <code>CO_API_KEY</code>. <strong>If you want to run dsRAG with different models, take a look at the \"Basic customization\" section below.</strong></p>\n<p>You can create a new KnowledgeBase directly from a file using the <code>create_kb_from_file</code> helper function:</p>\n<div><pre><span>from</span> <span>dsrag</span>.<span>create_kb</span> <span>import</span> <span>create_kb_from_file</span>\n<span>file_path</span> <span>=</span> <span>\"dsRAG/tests/data/levels_of_agi.pdf\"</span>\n<span>kb_id</span> <span>=</span> <span>\"levels_of_agi\"</span>\n<span>kb</span> <span>=</span> <span>create_kb_from_file</span>(<span>kb_id</span>, <span>file_path</span>)</pre></div>\n<p>KnowledgeBase objects persist to disk automatically, so you don't need to explicitly save it at this point.</p>\n<p>Now you can load the KnowledgeBase by its <code>kb_id</code> (only necessary if you run this from a separate script) and query it using the <code>query</code> method:</p>\n<div><pre><span>from</span> <span>dsrag</span>.<span>knowledge_base</span> <span>import</span> <span>KnowledgeBase</span>\n<span>kb</span> <span>=</span> <span>KnowledgeBase</span>(<span>\"levels_of_agi\"</span>)\n<span>search_queries</span> <span>=</span> [<span>\"What are the levels of AGI?\"</span>, <span>\"What is the highest level of AGI?\"</span>]\n<span>results</span> <span>=</span> <span>kb</span>.<span>query</span>(<span>search_queries</span>)\n<span>for</span> <span>segment</span> <span>in</span> <span>results</span>:\n <span>print</span>(<span>segment</span>)</pre></div>\n<p></p><h4>Basic customization</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#basic-customization\"></a><p></p>\n<p>Now let's look at an example of how we can customize the configuration of a KnowledgeBase. In this case, we'll customize it so that it only uses OpenAI (useful if you don't have API keys for Anthropic and Cohere). To do so, we need to pass in a subclass of <code>LLM</code> and a subclass of <code>Reranker</code>. We'll use <code>gpt-4o-mini</code> for the LLM (this is what gets used for document and section summarization in AutoContext) and since OpenAI doesn't offer a reranker, we'll use the <code>NoReranker</code> class for that.</p>\n<div><pre><span>from</span> <span>dsrag</span>.<span>llm</span> <span>import</span> <span>OpenAIChatAPI</span>\n<span>from</span> <span>dsrag</span>.<span>reranker</span> <span>import</span> <span>NoReranker</span>\n<span>llm</span> <span>=</span> <span>OpenAIChatAPI</span>(<span>model</span><span>=</span><span>'gpt-4o-mini'</span>)\n<span>reranker</span> <span>=</span> <span>NoReranker</span>()\n<span>kb</span> <span>=</span> <span>KnowledgeBase</span>(<span>kb_id</span><span>=</span><span>\"levels_of_agi\"</span>, <span>reranker</span><span>=</span><span>reranker</span>, <span>auto_context_model</span><span>=</span><span>llm</span>)</pre></div>\n<p>Now we can add documents to this KnowledgeBase using the <code>add_document</code> method. Note that the <code>add_document</code> method takes in raw text, not files, so we'll have to extract the text from our file first. There are some utility functions for doing this in the <code>document_parsing.py</code> file.</p>\n<div><pre><span>from</span> <span>dsrag</span>.<span>document_parsing</span> <span>import</span> <span>extract_text_from_pdf</span>\n<span>file_path</span> <span>=</span> <span>\"dsRAG/tests/data/levels_of_agi.pdf\"</span>\n<span>text</span> <span>=</span> <span>extract_text_from_pdf</span>(<span>file_path</span>)\n<span>kb</span>.<span>add_document</span>(<span>doc_id</span><span>=</span><span>file_path</span>, <span>text</span><span>=</span><span>text</span>)</pre></div>\n<p></p><h2>Architecture</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#architecture\"></a><p></p>\n<p></p><h2>KnowledgeBase object</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#knowledgebase-object\"></a><p></p>\n<p>A KnowledgeBase object takes in documents (in the form of raw text) and does chunking and embedding on them, along with a few other preprocessing operations. Then at query time you feed in queries and it returns the most relevant segments of text.</p>\n<p>KnowledgeBase objects are persistent by default. The full configuration needed to reconstruct the object gets saved as a JSON file upon creation and updating.</p>\n<p></p><h2>Components</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#components\"></a><p></p>\n<p>There are five key components that define the configuration of a KnowledgeBase, each of which are customizable:</p>\n<ol>\n<li>VectorDB</li>\n<li>ChunkDB</li>\n<li>Embedding</li>\n<li>Reranker</li>\n<li>LLM</li>\n<li>FileSystem</li>\n</ol>\n<p>There are defaults for each of these components, as well as alternative options included in the repo. You can also define fully custom components by subclassing the base classes and passing in an instance of that subclass to the KnowledgeBase constructor.</p>\n<p></p><h4>VectorDB</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#vectordb\"></a><p></p>\n<p>The VectorDB component stores the embedding vectors, as well as a small amount of metadata.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>BasicVectorDB</code></li>\n<li><code>WeaviateVectorDB</code></li>\n<li><code>ChromaDB</code></li>\n<li><code>QdrantVectorDB</code></li>\n<li><code>MilvusDB</code></li>\n</ul>\n<p></p><h4>ChunkDB</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#chunkdb\"></a><p></p>\n<p>The ChunkDB stores the content of text chunks in a nested dictionary format, keyed on <code>doc_id</code> and <code>chunk_index</code>. This is used by RSE to retrieve the full text associated with specific chunks.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>BasicChunkDB</code></li>\n<li><code>SQLiteDB</code></li>\n</ul>\n<p></p><h4>Embedding</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#embedding\"></a><p></p>\n<p>The Embedding component defines the embedding model.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>OpenAIEmbedding</code></li>\n<li><code>CohereEmbedding</code></li>\n<li><code>VoyageAIEmbedding</code></li>\n<li><code>OllamaEmbedding</code></li>\n</ul>\n<p></p><h4>Reranker</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#reranker\"></a><p></p>\n<p>The Reranker components define the reranker. This is used after the vector database search (and before RSE) to provide a more accurate ranking of chunks.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>CohereReranker</code></li>\n<li><code>VoyageReranker</code></li>\n</ul>\n<p></p><h4>LLM</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#llm\"></a><p></p>\n<p>This defines the LLM to be used for document title generation, document summarization, and section summarization in AutoContext.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>OpenAIChatAPI</code></li>\n<li><code>AnthropicChatAPI</code></li>\n<li><code>OllamaChatAPI</code></li>\n</ul>\n<p></p><h4>FileSystem</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#filesystem\"></a><p></p>\n<p>This defines the file system to be used for saving PDF images.</p>\n<p>For backwards compatibility, if an existing <code>KnowledgeBase</code> is loaded in, a <code>LocalFileSystem</code> will be created by default using the <code>storage_directory</code>. This is also the behavior if no <code>FileSystem</code> is defined when creating a new <code>KnowledgeBase</code>.</p>\n<p>The currently available options are:</p>\n<ul>\n<li><code>LocalFileSystem</code></li>\n<li><code>S3FileSystem</code></li>\n</ul>\n<p>Usage:\nFor the <code>LocalFileSystem</code>, only a <code>base_path</code> needs to be passed in. This defines where the files will be stored on the system\nFor the <code>S3FileSystem</code>, the following parameters are needed:</p>\n<ul>\n<li><code>base_path</code></li>\n<li><code>bucket_name</code></li>\n<li><code>region_name</code></li>\n<li><code>access_key</code></li>\n<li><code>access_secret</code></li>\n</ul>\n<p>The <code>base_path</code> is used when downloading files from S3. The files have to be stored locally in order to be used in the retrieval system.</p>\n<p></p><h2>Config dictionaries</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#config-dictionaries\"></a><p></p>\n<p>Since there are a lot of configuration parameters available, they're organized into a few config dictionaries. There are four config dictionaries that can be passed in to <code>add_document</code> (<code>auto_context_config</code>, <code>file_parsing_config</code>, <code>semantic_sectioning_config</code>, and <code>chunking_config</code>) and one that can be passed in to <code>query</code> (<code>rse_params</code>).</p>\n<p>Default values will be used for any parameters not provided in these dictionaries, so if you just want to alter one or two parameters there's no need to send in the full dictionary.</p>\n<p>auto_context_config</p>\n<ul>\n<li>use_generated_title: bool - whether to use an LLM-generated title if no title is provided (default is True)</li>\n<li>document_title_guidance: str - guidance for generating the document title</li>\n<li>get_document_summary: bool - whether to get a document summary (default is True)</li>\n<li>document_summarization_guidance: str</li>\n<li>get_section_summaries: bool - whether to get section summaries (default is False)</li>\n<li>section_summarization_guidance: str</li>\n</ul>\n<p>file_parsing_config</p>\n<ul>\n<li>use_vlm: bool - whether to use VLM (vision language model) for parsing the file (default is False)</li>\n<li>vlm_config: a dictionary with configuration parameters for VLM (ignored if use_vlm is False)\n<ul>\n<li>provider: the VLM provider to use - only \"vertex_ai\" is supported at the moment</li>\n<li>model: the VLM model to use</li>\n<li>project_id: the GCP project ID (required if provider is \"vertex_ai\")</li>\n<li>location: the GCP location (required if provider is \"vertex_ai\")</li>\n<li>save_path: the path to save intermediate files created during VLM processing</li>\n<li>exclude_elements: a list of element types to exclude from the parsed text. Default is [\"Header\", \"Footer\"].</li>\n</ul>\n</li>\n</ul>\n<p>semantic_sectioning_config</p>\n<ul>\n<li>llm_provider: the LLM provider to use for semantic sectioning - only \"openai\" and \"anthropic\" are supported at the moment</li>\n<li>model: the LLM model to use for semantic sectioning</li>\n<li>use_semantic_sectioning: if False, semantic sectioning will be skipped (default is True)</li>\n</ul>\n<p>chunking_config</p>\n<ul>\n<li>chunk_size: the maximum number of characters to include in each chunk</li>\n<li>min_length_for_chunking: the minimum length of text to allow chunking (measured in number of characters); if the text is shorter than this, it will be added as a single chunk. If semantic sectioning is used, this parameter will be applied to each section. Setting this to a higher value than the chunk_size can help avoid unnecessary chunking of short documents or sections.</li>\n</ul>\n<p>rse_params</p>\n<ul>\n<li>max_length: maximum length of a segment, measured in number of chunks</li>\n<li>overall_max_length: maximum length of all segments combined, measured in number of chunks</li>\n<li>minimum_value: minimum value of a segment, measured in relevance value</li>\n<li>irrelevant_chunk_penalty: float between 0 and 1</li>\n<li>overall_max_length_extension: the maximum length of all segments combined will be increased by this amount for each additional query beyond the first</li>\n<li>decay_rate</li>\n<li>top_k_for_document_selection: the number of documents to consider</li>\n<li>chunk_length_adjustment: bool, if True (default) then scale the chunk relevance values by their length before calculating segment relevance values</li>\n</ul>\n<p></p><h2>Metadata query filters</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#metadata-query-filters\"></a><p></p>\n<p>Certain vector DBs support metadata filtering when running a query (currently only ChromaDB). This allows you to have more control over what document(s) get searched. A common use case for this would be asking questions over a single document in a knowledge base, in which case you would supply the <code>doc_id</code> as a metadata filter.</p>\n<p>The format of the metadata filtering is an object with the following keys:</p>\n<ul>\n<li>field: str, # The metadata field to filter by</li>\n<li>operator: str, # The operator for filtering. Must be one of: 'equals', 'not_equals', 'in', 'not_in', 'greater_than', 'less_than', 'greater_than_equals', 'less_than_equals'</li>\n<li>value: str | int | float | list # If the value is a list, every item in the list must be of the same type</li>\n</ul>\n<p></p><h4>Example</h4><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#example\"></a><p></p>\n<div><pre><span># Filter with the \"equals\" operator</span>\n<span>metadata_filter</span> <span>=</span> {\n <span>\"field\"</span>: <span>\"doc_id\"</span>,\n <span>\"operator\"</span>: <span>\"equals\"</span>,\n <span>\"value\"</span>: <span>\"test_id_1\"</span>\n}\n<span># Filter with the \"in\" operator</span>\n<span>metadata_filter</span> <span>=</span> {\n <span>\"field\"</span>: <span>\"doc_id\"</span>,\n <span>\"operator\"</span>: <span>\"in\"</span>,\n <span>\"value\"</span>: [<span>\"test_id_1\"</span>, <span>\"test_id_2\"</span>]\n}</pre></div>\n<p></p><h2>Document upload flow</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#document-upload-flow\"></a><p></p>\n<p>Documents -&gt; semantic sectioning -&gt; chunking -&gt; AutoContext -&gt; embedding -&gt; chunk and vector database upsert</p>\n<p></p><h2>Query flow</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#query-flow\"></a><p></p>\n<p>Queries -&gt; vector database search -&gt; reranking -&gt; RSE -&gt; results</p>\n<p></p><h2>Community and support</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#community-and-support\"></a><p></p>\n<p>You can join our <a target=\"_blank\" href=\"https://discord.gg/NTUVX9DmQ3\">Discord</a> to ask questions, make suggestions, and discuss contributions.</p>\n<p>If you’re using (or planning to use) dsRAG in production, please fill out this short <a target=\"_blank\" href=\"https://forms.gle/RQ5qFVReonSHDcCu5\">form</a> telling us about your use case. This helps us prioritize new features. In return I’ll give you my personal email address, which you can use for priority email support.</p>\n<p></p><h2>Private cloud deployment</h2><a target=\"_blank\" href=\"https://github.com/D-Star-AI/dsRAG#private-cloud-deployment\"></a><p></p>\n<p>If you want to run dsRAG in production with minimal effort, reach out to us about our commercial offering, which is a managed private cloud deployment of dsRAG.</p>\n<p>Here are the high-level details of the offering:</p>\n<p><strong>Private cloud deployment (i.e. in your own AWS, Azure, or GCP account) of dsRAG.</strong></p>\n<ul>\n<li>Deployed as a production-ready API with endpoints for adding and deleting documents, viewing upload status, querying, etc.</li>\n<li>Unlimited number of KnowledgeBases. You can just pass in the kb_id with each API call to specify which one it’s for.</li>\n<li>Document upload queue with configurable concurrency limits so you don’t have to worry about rate limiting document uploads in your application code.</li>\n<li>VectorDB and ChunkDB are created and managed as part of the API, so you don’t have to set those up separately.</li>\n<li>Could also be deployed directly into your customers’ cloud environments if needed.</li>\n</ul>\n<p><strong>Support</strong></p>\n<ul>\n<li>We’ll help you customize the retrieval configuration and components for your use case and make sure everything runs smoothly and performs well.</li>\n<li>Ongoing support and regular updates as needed.</li>\n</ul>\n<p>If this is something you’d like to learn more about, fill out this short <a target=\"_blank\" href=\"https://forms.gle/Z4n81qdwdpckqsct6\">form</a> and we’ll reach out ASAP.</p>\n</article></div>",
"author": "",
"favicon": "https://github.githubassets.com/favicons/favicon.svg",
"source": "github.com",
"published": "",
"ttr": 541,
"type": "object"
}