Tech Blog

Build a RAG system with Jina Embeddings and Qdrant

Create a RAG system with Jina Embeddings v2, Qdrant vector database, LlamaIndex, and Mistral LLM.

Scott Martens

Mar 4, 2024 • 13 min read

Retrieval-augmented generation is a rapidly growing group of techniques that were originally conceived as an extension to large language models (LLMs) but are increasingly seen as an alternative paradigm that re-imagines the role of AI models entirely.

We’ll show you how to make a basic RAG engine using the LlamaIndex framework, an open-source Mistral LLM, Jina Embeddings v2, and Qdrant’s AI-ready vector database.

RAG in Context

In a RAG system, the role of the AI model is still central but much more limited. Instead of answering prompts from its internal knowledge, derived from its training data, it’s presented with the relevant information in its prompt. The role of the model is then to understand and synthesize information into an output.

This plays to the strengths of AI models and minimizes its weaknesses.

For example, we might give the Mistral-Instruct model the following prompt:

Answer the query. Please be brief, concise, and complete.
Query: Why is Taylor Swift planning to attend the Superbowl?
Answer:

When I did that, the model responded with:

I don't have real-time information. However, as of my last update, 
Taylor Swift is not officially confirmed to attend or perform at 
the Superbowl. If there are any updates, I recommend checking the 
latest news sources.

Large language models are expensive to update and retrain, so their internal information can never be entirely up to date.

Furthermore, large language models are well known to “hallucinate,” i.e., confabulate plausible answers that may be partly or wholly incorrect and have to be verified.

Answer the query. Please be brief, concise, and complete.
Query: What was Weird Al Yankovic's first Taylor Swift parody?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," a spoof 
of Swift's song "Blank Space." The parody was released in 2014 
as part of Yankovic's album "Mandatory Fun." The song humorously 
addresses the obsession with celebrity gossip and scandals, as 
reported by the tabloid news website TMZ.

Weird Al Yankovic’s TMZ was released in 2011 on the album Alpocalypse and is a parody of Swift’s You Belong with Me.

The nature of neural networks makes hallucination difficult, if not impossible, to completely prevent and strongly undermines any plan to use AI models as information sources.

In the words of Sam Altman, CEO of OpenAI:

“I think people make a big mistake when they think of these models [LLMs] as a database. […] It will work, sort of. But it’s, like, the world’s most expensive, slowest, least accurate database.”

RAG minimizes reliance on an AI model’s internal knowledge of facts about the world and takes advantage of its biggest strengths: natural language understanding and fluent language generation.

Let’s take another example of a question Mistral-Instruct can’t answer:

Answer the query. Please be brief, concise, and complete.

Query: What is the title of Taylor Swift's next album?
Answer:

As of the time I am providing this answer, Taylor Swift has 
not yet announced the title of her next album. She has released 
several re-recorded albums, but the title of her next original 
album remains unknown. I recommend checking her official social 
media channels or website for the latest updates.

Now, a RAG system might have a database of recent news articles, or perhaps search the Internet for texts related to the title of Taylor Swift’s next album. Then, it would insert the texts it finds into the prompt.

Let’s assume it identifies this AP article as relevant to the query and constructs the following prompt:

Context information is below.
---------------------
February 6, 2024
LOS ANGELES (AP) — Forget her “Reputation.” Taylor Swift has a 
whole new album coming out.

Accepting the Grammy for best pop vocal album on Sunday night, 
Taylor said she’d been keeping a secret for two years.

“My brand-new album comes out April 19. It’s called ‘The Tortured 
Poets Department.’ I’m going to go and post the cover right now 
backstage,” she announced.

And so she did.

---------------------
Given the context information and not prior knowledge, answer the query. 
Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What is the title of Taylor Swift's next album?
Answer:

Mistral-Instruct responded with:

The title of Taylor Swift's next album is 'The Tortured Poets Department'.

This kind of prompting also reduces the likelihood of hallucinations dramatically. For example:

Context information is below.
---------------------
Alpocalypse is the thirteenth studio album by American parody 
musician "Weird Al" Yankovic, released on June 21, 2011.

Track List

1. Perform This Way (Parody of "Born This Way" by Lady Gaga)
2. CNR
3. TMZ (Parody of "You Belong with Me" by Taylor Swift)
4. Skipper Dan
5. Polka Face (Polka medley including songs by artists such as 
   Lady Gaga, Justin Bieber, Britney Spears, Ke$ha, and others)
6. Craigslist
7. Party in the CIA (Parody of "Party in the U.S.A." by Miley 
   Cyrus)
8. Ringtone
9. Another Tattoo (Parody of "Nothin' on You" by B.o.B featuring 
   Bruno Mars)
10. If That Isn't Love
11. Whatever You Like (Parody of "Whatever You Like" by T.I.)
12. Stop Forwarding That Crap to Me

---------------------
Given the context information and not prior knowledge, answer the 
query. Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What was Weird Al Yankovic's first Taylor Swift parody, what 
year was it released, and what song was he parodying?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," and it was 
released in 2011. He was parodying "You Belong with Me" by Taylor 
Swift.

Anatomy of a RAG system

A RAG system has, minimally, two components:

A generative AI model.
An information retrieval system.

In this article, we will use Mistral-Instruct as the generative AI model and access it via the Hugging Face Inference API. We will use the Qdrant vector database and Jina Embeddings together for the information retrieval system.

As a data source, we will use the National Artificial Intelligence Research and Development Strategic Plan 2023 Update, published in May 2023 by the National Science and Technology Council, a body that advises the US President on scientific and technical matters. This roughly 25,000-word document is a policy and strategic planning document for American AI policy and development.

It's also a classic “tl;dr” government policy paper. Most of it is very boring, and unless you have a stake or a close interest in this area, you probably aren’t going to read it. But you might still want to know what it says.

This article will walk you through the steps of creating a RAG system able to answer questions about the text and its contents.

Install the Prerequisites

First, make sure you have all the relevant Python libraries installed. In your Python environment, run the following:

pip install llama-index qdrant-client 
pip install pdfminer.six llama-index-vector-stores-qdrant
pip install llama-index-llms-openai llama-index-embeddings-jinaai 
pip install llama-index-llms-huggingface "huggingface_hub[inference]"

We will be using LlamaIndex to construct a RAG system around the Mistral-Instruct large language model and pdfminer.six to process the PDF file into a Qdrant vector store.

Connect to Mistral-Instruct via the Hugging Face Inference API

First, set up the LLM. You will need access to the Hugging Face Inference API, including an access token. If you have a Hugging Face account, you can get one from your account settings page.

If you do not have an account, first set one up, then create an access token.

Place your token in a variable for future use:

hf_inference_api_key = '<your HuggingFace Inference API token>'

Next, we construct a prompt template:

from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Please be brief, concise, and complete.\n"
    "If the context information does not contain an answer to the query, "
    "respond with \"No information\".\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

This template instructs the LLM to use only the context information provided in the prompt to answer questions. This prompt has worked well for us in working with Mistral-Instruct, but you can experiment with other wordings to see what works for your use case.

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", 
    token=hf_inference_api_key
)

Finally, create and initialize an object for the LlamaIndex framework that holds the connection to Mistral-Instruct. The code below comes from the previous post on creating RAG systems using LlamaIndex.

This puts the connection to the Mistral LLM into the variable mistral_llm.

Information Retrieval with Qdrant and Jina Embeddings

To set up the retrieval system, you will need a Jina Embeddings API key. You can get one for free with a 1 million-token prepaid budget at the Jina Embeddings website.

Place your Jina Embeddings key in a variable for future use:

jina_emb_api_key = "<your Jina Embeddings API key>"

Then, create a connector object using LlamaIndex for the Jina Embeddings server, selecting specifically the English monolingual model:

from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    api_key=jina_emb_api_key,
    model="jina-embeddings-v2-base-en",
)

Load text data

Next, we will load the document and split it up into paragraphs. First, download the PDF from the White House website into the variable pdf_data:

import urllib.request

uri = "https://www.whitehouse.gov/wp-content/uploads/2023/05/National-Artificial-Intelligence-Research-and-Development-Strategic-Plan-2023-Update.pdf"
pdf_data = urllib.request.urlopen(uri).read()

Next, we will process the PDF page by page, extracting the text and then chunking it into paragraphs by splitting on double line breaks. These are stored in the list text_paras:

import regex as re
from io import BytesIO, StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

text_paras = []
parser = PDFParser(BytesIO(pdf_data))
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
for page in PDFPage.create_pages(doc):
    output_string = StringIO()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    page_text = output_string.getvalue()
    text_paras.extend(re.split(r'\n\s*\n', page_text))

Check that everything is loaded:

assert len(text_paras) == 615

Next, we will covert this list of short texts into LlamaIndex Document objects:

from llama_index.core.readers import StringIterableReader

rag_docs = StringIterableReader().load_data(text_paras)

And you can inspect the text:

print(rag_docs[0].text)

Result:

NATIONAL ARTIFICIAL INTELLIGENCE 
RESEARCH AND DEVELOPMENT 
STRATEGIC PLAN 
2023 UPDATE

Set up a Qdrant Vector Database

In this article, we will use the Qdrant Vector Search Cloud to implement the RAG system’s database. Qdrant’s free tier includes 1GB of storage, which is more than enough for this tutorial.

You will need to create an account on the Qdrant Cloud website before continuing.

Once you have an account and are logged in, you will need to create a cluster. Follow the “quick start” instructions on the Qdrant Website to set up a free cluster and get an API and the name of the Qdrant host server.

Store the key and hostname in variables:

qdrant_api_key = "<your API key>"
qdrant_server = "https://<your server>"

Next, we will need to import the relevant components from the qdrant_client and llama_index packages:

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(qdrant_server, api_key=qdrant_api_key)
vector_store = QdrantVectorStore(client=client, collection_name="NTSC")

This creates a collection named NTSC in your free cluster.

You can read the Qdrant documentation for information on constructing a local data store in memory or on disk and hosting your own Qdrant server instance in a docker container via a web API.

Complete the RAG system

Now we will assemble these components into a complete RAG system using boilerplate code for LlamaIndex:

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import (
		VectorStoreIndex,
		ServiceContext,
		get_response_synthesizer,
)

# set up the service and storage contexts
service_context = ServiceContext.from_defaults(
    llm=mixtral_llm, embed_model=jina_embedding_model
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create an index
index = VectorStoreIndex.from_documents(
    rag_docs, storage_context=storage_context, service_context=service_context
)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=service_context,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

This system will use Jina Embeddings to index the paragraphs from our text source and will put the two best matches into the context of every prompt. To change to the three best, for example, just modify the VectorIndexRetriever initialization above to have a different similarity_top_k parameter:

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

Now the system is ready to use.

Query a Document

Let’s try a straightforward query:

response = query_engine.query("""
What is the Biden Administration's policy with regard to AI? 
""")
print(response.response)

The result (yours may not be identical):

The Biden Administration prioritizes responsible AI systems that 
are ethical, trustworthy, and safe, and serve the public good. 
The Administration has committed to substantial funding for AI 
R&D, and has established the National AI Initiative Office to 
coordinate AI activities across the federal government. 

The Administration's AI policy focuses on developing evaluative
techniques for AI, workforce development, public-private
partnerships, and international collaboration in AI research.

Or something more specific:

response = query_engine.query("""
What protections does the AI Bill of Rights propose to offer?
""")

print(response.response)

The AI Bill of Rights proposes five core protections: Safe and 
Effective Systems; Algorithmic Discrimination Protections; Data 
Privacy; Notice and Explanation; and Human Alternatives, 
Consideration, and Fallback.

Or even very specific:

response = query_engine.query("Who is Kei Koizumi?")
print(response.response)

Kei Koizumi is the Principal Deputy Director for Policy at the Office 
of Science and Technology Policy (OSTP).

You can also ask more fanciful questions:

response = query_engine.query("""
What rights will AI's receive under President Biden's proposed 
AI Bill of Rights?
""")
print(response.response)

The proposed AI Bill of Rights does not grant rights to AI systems, but 
rather outlines principles to ensure that AI technologies are 
developed and used in a manner that respects human rights and values. 
The principles include safety, transparency, fairness, disclosure, 
data privacy, non-discrimination, and the ability for individuals to 
opt out of certain AI systems.

response = query_engine.query("""
Why is President Biden's proposing an AI Bill of Rights?
Does AI really need rights?
""")
print(response.response)

President Biden's proposed AI Bill of Rights aims to ensure that AI 
technologies are developed and used ethically, legally, and with 
societal considerations in mind. The Bill of Rights is not granting 
"rights" to AI, but rather setting guidelines for the responsible 
development and deployment of AI systems to protect individuals and 
communities from potential negative impacts.

response = query_engine.query("""
Has Donald Trump weighed in on AI?
Will he Make Humans Great Again?
""")
print(response.response)

No information. The context information does not mention Donald 
Trump's views on AI.

Jina AI and Qdrant

Jina Embeddings’ 8k token context window makes it especially well-suited to retrieval-augmented generation because it can work with much larger blocks of text. Where other RAG applications often have to store individual sentences or pairs of sentences, in this tutorial, we could use whole paragraphs without worrying about their size.

This reduces the number of text items that have to be stored to carry the same information, lowering compute costs, sparing memory, and making retrieval faster.

Qdrant’s vector database is simple to configure, fast, and cost-effective, and as you’ve seen, it takes only a few lines of Python to integrate it into a RAG system.

For more about Qdrant’s AI-ready database and cloud offerings, you can visit their website.

Get in Touch

Jina AI is committed to bringing reliable, affordable AI technologies to enterprises of every size and type. We’d love to hear about your use cases and help fit AI into your business processes. For more information about Jina AI’s offerings and to contact us, check out the Jina AI website or join our community on Discord.