Four ways to improve the retrieval of a RAG system

4 min readJan 7, 2024

Last time, I expressed my optimism in Retrieval-Augmented Generation (RAG) as the core mechanism in personalizing large language model (LLM) applications. What if we want to do better?

In this post, I will summarize four techniques in improving the retrieval quality of a RAG system: the sub-question query engine (from LlamaIndex), RAG-Fusion, RAG-end2end, and the famous LoRA trick.

Asking the LLM to break down the question

A RAG system employs a LLM to generate natural-language answers to users’ queries. We don’t have to stop there, though: We can take further advantage of LLMs by asking it to break down the user’s query into sub-questions first. The document retriever can then look up each of the smaller questions, giving the answer-generating LLM richer context to play with.

An example of how the LLM broke down a question into 3 specfic sub-questions.

The fun part is how we aggregate those recall sets.

Naively, we can just concatenate the retrieved sets of documents and feed them into the LLM as context when generating the answer. We can be completely ignorant about the order in which we present retrieved documents. Alternatively, we can do much better.

In LlamaIndex, the sub-question query engine invokes LLM extensively. For each retrieved set of documents, it generates an answer to the corresponding sub-question. Then, the LLM is asked to come up with a final answer based on those sub-answers, not the retrieved documents themselves.

An instance of LlamaIndex’s Sub-Question Query Engine (impersonating the user) prompting the LLM to generate a final answer based on three sub-answers.

RAG-Fusion, on the other hand, still feeds documents to the LLM as context. But first, it sorts documents based on how many recall sets each document appeared in. This technique is called Reciprocal Rank Fusion (RRF). Implicitly, RRF assumes:

  • Documents that are relevant to more sub-questions are more helpful in answering the original query. (A counter-example is a generic piece of article that is relevant to every question, but not specific enough to provide any value to the final answer.)
  • LLM can prioritize top results with more weight, instead of treating the list as an unordered set.

RRF allows one to combine results via different search methods, a paradigm often known as “hybrid search”.

  • Azure AI Search uses it to aggregate recall sets from traditional, verbatim text searches and from embedding-based, vectorial searches.
  • A more approachable example is Obisidian-Copilot, a plugin to the note-taking app Obsidian that combines BM25-based searches (via OpenSearch) with semantic searches.
  • If you feel like implementing one yourself, this tutorial from the vector database provider Pinecone is a good start.

Note that, the various search methods provided to RRF can only differ by how they arrive at their recall sets, not by having disjoint sets of documents to choose from. Trivially, documents have to have a chance to appear in separate recall sets in the first place, before they can be re-ranked by reciprocal occurrence.

Re-introducing the training process

In Why RAG is big, I identified a major advantage of RAG as “no need of training any models” — You can use off-the-shelf embedding models and LLMs when putting together a RAG system with reasonable accuracy.

But if you want, you can. This may especially help with domain-specific applications, as RAG-end2end has demonstrated. In fact, the paper presenting the original Dense Passage Retrieval method (DPR; the “R” in “RAG”) also had to spice the encoders (BERTs) up with some fine-tuning before it outperformed BM25 by 25%.

I assume that many are interested in RAG due to budget constraints, so training a neural model — even merely fine-tuning it — is not feasible. In that case, you may want to check out LoRA, or “Low-Rank Adaptation of Large Language Models”.

LoRA isn’t for LLMs only; it also works for Stable Diffusion (SD), a family of generative art models. LoRAs adapt SD models to different styles, as shown on the website Civitai. I think this is an intuitive illustration of what LoRA can do.

Simply put, LoRA is a “hack” where you add small trainable layers to the original model and train only those little “patches”. Thus, LoRA adapts a LLM to specific domains of knowledge without updating all of its parameters (think: billions). Also, since you can control the size of those “patches” to inject, LoRA can adapt to your particular time and hardware budget. Takeaway: It is cheap and versatile.


In this short post, I reviewed four techniques for improving the relevancy of a RAG pipeline. Two of them relied on breaking down the original query and exploiting the generative capability of LLMs, while two of them strive to further enhance the models themselves with domain-specific knowledge.

As this is a vibrant area of research today, I’m sure there are many I’ve missed, as well as many more new tricks to come.