← Resources

Improving Retrieval-Augmented Generation with Agentic RAG: A Step-By-Step Guide for AI Leaders [With Example Code]

Engineering
Blog

Image: Possessed Photography via Unsplash

Retrieval-Augmented Generation (RAG) is widely used in natural language processing, particularly for integrating external knowledge bases. However, it often struggles with poorly structured queries—leading to inaccuracies or missed information. 

Agentic RAG addresses these issues by refining queries and improving the accuracy of responses. This guide will walk you through how Agentic RAG can elevate your AI systems—complete with a hands-on coding example.

What is RAG?

RAG, or Retrieval-Augmented Generation, is a hybrid approach that combines traditional information retrieval methods with generative language models. Instead of relying solely on a model's internal knowledge, RAG retrieves relevant documents or data from an external knowledge base, using them to generate more accurate and contextually relevant responses.

In a standard RAG setup:

  • Retrieval: The system performs a semantic search to find the most relevant pieces of information from a database or knowledge base.
  • Augment: The "augmented" aspect refers to the enhancement of the language model's output by integrating real-time, contextually relevant data from the retrieval phase, resulting in more accurate and tailored responses.
  • Generation: The retrieved information is then fed into a language model to generate a final response.

This method is useful in scenarios where the language model's internal knowledge might be outdated or insufficient. However, the effectiveness of RAG depends heavily on the quality of the retrieval process. If the search fails to retrieve the right information, the final output may be inaccurate or irrelevant.

The Challenges of Traditional RAG

In a typical RAG setup, a user’s query is processed through a semantic search to retrieve relevant data from a knowledge base. If the query isn't well-structured, the system might return irrelevant results or fail to find important information.

How Agentic RAG Addresses This

Agentic RAG offers a more robust solution by actively refining and re-evaluating queries to ensure higher accuracy and relevance. Here's how it works:

Agentic RAG uses AI agents that:

  1. Refine the Query: The AI agent refines the user’s query for better clarity and precision.
  2. Retrieve Information: The refined query is used to search the knowledge base more effectively.
  3. Re-evaluate Results: The AI agent can assess the retrieved information and refine the search further if needed.
  4. Generate the Final Response: Once the agent has the right data, it generates the final answer.

This approach reduces errors and ensures more reliable results.

Step-By-Step Guide

Now that you know the theory around Agentic RAG, let’s get practical and build an example. Follow my steps below to learn how to set it up. 

Setting Up Your Environment

Before diving into the code, make sure your virtual environment is set up. If not, follow this setup guide.

We'll use HuggingFace's transformers library, though you can explore other alternatives as needed.

Step 1: Install the Required Packages

First, install the necessary libraries:

pip install langchain langchain-openai langchain-community langchain-chroma langchain-huggingface huggingface-hub python-dotenv sentence-transformers "transformers[agents]"

Now you should be able to see these packages inside your venv folder.

Step 2: Set Up Imports

Create a main.py file and add the following imports:

1import os
2import datasets
3from dotenv import load_dotenv
4from transformers import AutoTokenizer
5from langchain.docstore.document import Document
6from langchain.text_splitter import RecursiveCharacterTextSplitter
7from langchain_community.vectorstores import Chroma
8from langchain_huggingface import HuggingFaceEmbeddings
9from tqdm import tqdm
10from transformers.agents import ReactJsonAgent
11from langchain_openai import ChatOpenAI
12import logging
13from RetrieverTool import RetrieverTool
14from OpenAIEngine import OpenAIEngine
15
16if __name__ == "__main__": #This is called when the file is called directly
17   main()


Step 3: Load the Dataset and Prepare the RAG Components

To start, load your dataset. As an example, I’ll use a mental health dataset from Huggingface. You can find it here.

# Load the knowledge base
knowledge_base = datasets.load_dataset("TVRRaviteja/Mental-Health-Data", split="train")
# Convert dataset to Document objects
source_docs = [
 Document(page_content=doc["text"]) # Convert dataset into array of Langchain Document
 for doc in knowledge_base

Next, set up a tokenizer and define a text splitter:

# Initialize the text splitter
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
   tokenizer,
   chunk_size=200,
   chunk_overlap=20,
   add_start_index=True,
   strip_whitespace=True,
   separators=["\n\n", "\n", ".", " ", ""],
)


Configure the Tokenizer and Text Splitter

Before generating embeddings, you need to set up a tokenizer and text splitter to process the dataset effectively.

1. Load the Pre-trained Tokenizer:

We use a pre-trained tokenizer from the Hugging Face model hub. The thenlper/gte-small model is used to convert text into tokens that the model can understand.

2. Set Up the Recursive Character Text Splitter:

The text splitter divides large texts into smaller chunks for processing by the model. Here’s how the arguments customize the splitting behavior:some text

  • tokenizer: Measures the size of the text chunks.
  • chunk_size=200: Sets each chunk to 200 tokens.
  • chunk_overlap=20: Maintains a 20-token overlap between chunks to preserve context.
  • add_start_index=True: Adds the start index of each chunk, aligning them with their positions in the source text.
  • strip_whitespace=True: Removes leading or trailing whitespace from the chunks.
  • separators=["\n\n", "\n", ".", " ", ""]: Specifies the hierarchy of separators, starting with the most significant (e.g., double newline \n\n) down to the least significant (e.g., space " ").

Final Steps: Implementing and Running Agentic RAG

Now that you've prepared the tokenizer and text splitter, it's time to generate embeddings and store them in a vector database (such as Chroma). Here's how to proceed:

Initialize the Embedding Model

You can use HuggingFace’s embedding model to create vector embeddings for your documents.

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")

# Create the vector database
vectordb = Chroma.from_documents(
   documents=docs_processed,
   embedding=embedding_model,
   persist_directory="chroma"
)


Create the Vector Database

Store the generated embeddings in a vector database, such as Chroma.

Set Up the Retriever Tool:

Create a generic retriever tool for all vector stores in Langchain. This tool will handle document retrieval based on semantic similarity.

1from transformers.agents import Tool
2class RetrieverTool(Tool):
3   name = "retriever"
4   description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
5   inputs = {
6       "query": {
7           "type": "text",
8           "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
9       }
10   }
11   output_type = "text"
12   def __init__(self, vectordb, **kwargs):
13       super().__init__(**kwargs)
14       self.vectordb = vectordb
15   def forward(self, query: str) -> str:
16       assert isinstance(query, str), "Your search query must be a string"
17
18       docs = self.vectordb.similarity_search(
19           query,
20           k=7,
21       )
22       return "\nRetrieved documents:\n" + "".join(
23           [f"===== Document {str(i)} =====\n" + doc.page_content for i, doc in enumerate(docs)]
24       )

Attributes:

. inputs: Specifies the expected input for the tool, which is a query of type text. The description suggests the query should be in an affirmative form rather than a question.

. output_type = "text": Specifies that the output of the tool will be text.

2. __init__ Method:

The constructor initializes the RetrieverTool with a vectordb (a vector database used for similarity search) and any additional keyword arguments (kwargs).

3. forward Method:

  • Purpose: Implements the logic for performing the document retrieval.
  • query: The method accepts a query string as input.
  • assert isinstance(query, str): Ensures the query is a string.
  • self.vectordb.similarity_search(query, k=7): Performs a similarity search on the vector database using the query, retrieving the top 7 most similar documents.
  • return "\nRetrieved documents:\n" + "".join(...): Formats the retrieved documents into a string and returns it. The output includes headers like "===== Document 0 =====" for each document, followed by the document's content.

Develop the OpenAI Engine

Set up an engine that uses OpenAI for retrieval and LLM operations. This engine will handle the interaction with the OpenAI API to generate responses.

1import os
2from openai import OpenAI
3from dotenv import load_dotenv
4from transformers.agents.llm_engine import MessageRole, get_clean_message_list
5
6load_dotenv()
7
8openai_role_conversions = {
9   MessageRole.TOOL_RESPONSE: MessageRole.USER,
10}
11
12class OpenAIEngine:
13   def __init__(self, model_name="gpt-4-turbo"):
14       self.model_name = model_name
15       self.client = OpenAI(
16           api_key=os.getenv("OPENAI_API_KEY"),
17       )
18
19   def __call__(self, messages, stop_sequences=[]):
20       messages = get_clean_message_list(messages, role_conversions=openai_role_conversions)
21
22       response = self.client.chat.completions.create(
23           model=self.model_name,
24           messages=messages,
25           stop=stop_sequences,
26           temperature=0.5,
27       )
28       return response.choices[0].message.content


Create the Agent

Combine the retriever tool and OpenAI engine into an agent using ReactJsonAgent. This agent will process queries, retrieve relevant information, and generate responses.

retriever_tool = RetrieverTool(vectordb)

llm_engine = OpenAIEngine()
# Create the agent
agent = ReactJsonAgent(tools=[retriever_tool], llm_engine=llm_engine, max_iterations=3, verbose=2)


Run the Agentic Code

def run_agentic_rag(question: str) -> str:
   enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
   give a comprehensive answer to the question below.
   Respond only to the question asked, response should be concise and relevant to the question.
   If you cannot find information, do not give up and try calling your retriever again with different arguments!
   Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
   Your queries should not be questions but affirmative form sentences: e.g. rather than "How to check personality scores of someone who is open and agreeable?", query should be "find me personality scores of someone who is open and agreeable".

   Question:
   {question}"""

   return agent.run(enhanced_question)


Compare with Standard RAG

Implement a standard RAG method to compare with the Agentic RAG approach. This will help you understand the improvements and benefits of Agentic RAG.

def run_standard_rag(question: str) -> str:
   prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
   Respond only to the question asked, response should be concise and relevant to the question.
   Provide the number of the source document when relevant.

   Question:
   {question}
   """
   messages = [{"role": "user", "content": prompt}]

   reader_llm = ChatOpenAI(model="gpt-4-turbo", api_key=os.getenv("OPENAI_API_KEY"))
   ai_msg = reader_llm.invoke(messages)

   return ai_msg.content

Now, run these functions by making changes in the main function that we wrote in the first step.

def main():
   init()
   question = """
   How can i check my score? If I am procarstinating and but at the same time I have imposter syndrome.
   """
   print(f"Question: {question}")

   agentic_answer = run_agentic_rag(question)
   print("Agentic RAG Answer:")
   print(f"Answer: {agentic_answer}")

   standard_answer = run_standard_rag(question)
   print("\nStandard RAG Answer:")
   print(f"Answer: {standard_answer}")


Run the Script

Open your terminal and run the script using the following command:

python main.py

Compare the Outputs

After running the script, compare the outputs from Agentic RAG and standard RAG. Analyze how Agentic RAG refines the query and delivers more accurate and relevant results.

Query: How can I check my score? If I am procrastinating and at the same time I have imposter syndrome.

Standard RAG Answer Agentic RAG Answer
To check your score, you can refer to Documents 4, 5, and 6, which contain information on Personality Scores. These documents will provide the specific details you need regarding your scores. If procrastination or imposter syndrome is affecting your ability to check your scores, consider setting a specific, manageable goal to review your scores at a designated time, thereby reducing the tendency to delay due to these feelings. To check your "score" or assess yourself regarding procrastination and imposter syndrome, you can use specific psychological self-assessment tools. Unfortunately, specific tools or questionnaires were not identified in the retrieved documents. However, generally, you can consider the following approaches:

1. Procrastination Assessment:
For procrastination, tools like the "Procrastination Assessment Scale for Students (PASS)" or the "Procrastination Inventory" can be used to measure levels of procrastination in various contexts.

2. Imposter Syndrome Assessment:
For imposter syndrome, the "Clance Imposter Phenomenon Scale" is a commonly used tool to assess feelings of fraudulence and self-doubt in personal achievement contexts.

These tools typically consist of a series of statements where you rate your agreement or frequency, resulting in a score that indicates your level of procrastination or imposter feelings. You can find these assessments online or through psychological services. They offer a structured way to reflect on your behaviors and feelings, providing a "score" that reflects the severity or frequency of the traits in question.
Note: You can always improve the performance by having a better embedding model with more dimensions and a bigger dataset with better data quality.

Securing Data in RAG and agentic RAG deployments

When implementing RAG, especially in enterprise or sensitive environments, securing the data becomes paramount. Here's how to ensure that your RAG implementation is both effective and secure:

  1. Ensure proper authorization: Verify that users only access the data they are authorized to see. This means embedding a robust authorization layer within your RAG system that cross-checks user permissions before any data is retrieved or generated.
  2. Embed permission layers on data: During the retrieval process, embed permissions on the data itself. This ensures that when a similarity search is performed, the system retrieves and generates information based only on what the user is authorized to access. This is particularly critical when dealing with sensitive or regulated data, as it helps prevent unauthorized information disclosure.
  3. Importance in Agentic RAG: By leveraging agents, we can refine queries to mitigate potential security risks, such as the exposure of Personally Identifiable Information (PII). This approach enables the implementation of an additional security layer before the data access layer. Agents can also dynamically adapt to evolving security requirements, continuously monitoring and adjusting queries to prevent unauthorized data access. This ensures that sensitive information is consistently protected, even as threats and data patterns change.

Conclusion

Agentic RAG tackles the limitations of traditional RAG by refining and improving the accuracy of query responses. By following this guide, you can implement Agentic RAG in your AI systems, ensuring more precise and reliable outputs.

We would love to talk to you about implementing an Agentic RAG infrastructure—while ensuring security and compliance are covered from the get-go. 

Please reach out to schedule a time with us, or connect with me on LinkedIn and explore the full code on GitHub. 

About the Author

Ameer Hamza is a highly talented Software Engineer at Opsin, specializing in building the next-generation security orchestration layer for GenAI applications. With a strong background in large language models (LLMs) and GenAI, Ameer plays a crucial role in developing innovative solutions that ensure secure and compliant AI deployments. His expertise extends beyond coding; he is instrumental in shaping the architecture and design of Opsin's cutting-edge products. Ameer is passionate about leveraging advanced AI technologies to create robust security frameworks, making him a key contributor to Opsin's mission of enabling secure and compliant GenAI by design.

Offload Security

Accelerate your GenAI innovation
Book a Demo →