Harshad Suryawanshi • Feb 2, 2024
RAGArch: Building a No-Code RAG Pipeline Configuration & One-Click RAG Code Generation Tool Powered by LlamaIndex
Unlocking the power of AI should be as intuitive as using your favorite apps. That’s the philosophy behind RAGArch, my latest creation designed to demystify and streamline the process of setting up Retrieval-Augmented Generation (RAG) pipelines. This tool is born from a simple vision: to provide a straightforward, no-code platform that empowers both seasoned developers and curious explorers in the world of AI to craft, test, and implement RAG pipelines with confidence and ease.
Features
RAGArch leverages LlamaIndex’s powerful LLM orchestration capabilities, to provide a seamless experience and granular control over your RAG pipeline.
- Intuitive Interface: RAGArch’s user-friendly interface, built with Streamlit, allows you to test different RAG pipeline components interactively.
- Custom Configuration: The app provides a wide range of options to configure Language Models, Embedding Models, Node Parsers, Response Synthesis Methods, and Vector Stores to suit your project’s needs.
- Live Testing: Instantly test your RAG pipeline with your own data and see how different configurations affect the outcome.
- One-Click Code Generation: Once you’re satisfied with the configuration, the app can generate the Python code for your custom RAG pipeline, ready to be integrated into your application.
Tools and Technologies
The creation of RAGArch was made possible by integrating a variety of powerful tools and technologies:
- UI: Streamlit
- Hosting: Hugging Face Spaces
- LLMs: OpenAI GPT 3.5 and 4, Cohere API, Gemini Pro
- LLM Orchestration: Llamaindex
- Embedding Models: “BAAI/bge-small-en-v1.5”, “WhereIsAI/UAE-Large-V1”, “BAAI/bge-large-en-v1.5”, “khoa-klaytn/bge-small-en-v1.5-angle”, “BAAI/bge-base-en-v1.5”, “llmrails/ember-v1”, “jamesgpt1/sf_model_e5”, “thenlper/gte-large”, “infgrad/stella-base-en-v2” and “thenlper/gte-base”
- Vector Stores: Simple (Llamaindex default), Pinecone and Qdrant
Deep Dive into the Code
The app.py
script is the backbone of RAGArch, integrating various components to provide a cohesive experience. The following are the key functions of app.py
upload_file
This function manages file uploads and uses Llamaindex's SimpleDirectoryReader
to load documents into the system. It supports a wide array of document types, including PDFs, text files, HTML, JSON files, and more, making it versatile for processing diverse data sources.
def upload_file():
file = st.file_uploader("Upload a file", on_change=reset_pipeline_generated)
if file is not None:
file_path = save_uploaded_file(file)
if file_path:
loaded_file = SimpleDirectoryReader(input_files=[file_path]).load_data()
print(f"Total documents: {len(loaded_file)}")
st.success(f"File uploaded successfully. Total documents loaded: {len(loaded_file)}")
return loaded_file
return None
save_uploaded_file
This utility function saves the uploaded file to a temporary location on the server, making it accessible for further processing. It’s a crucial part of the file handling process, ensuring data integrity and availability.
def save_uploaded_file(uploaded_file):
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(uploaded_file.name)[1]) as tmp_file:
tmp_file.write(uploaded_file.getvalue())
return tmp_file.name
except Exception as e:
st.error(f"Error saving file: {e}")
return None
select_llm
Allows users to select a Large Language Model and initializes it for use. You can choose from Google’s Gemini Pro, Cohere, OpenAI’s GPT 3.5 and GPT 4.
def select_llm():
st.header("Choose LLM")
llm_choice = st.selectbox("Select LLM", ["Gemini", "Cohere", "GPT-3.5", "GPT-4"], on_change=reset_pipeline_generated)
if llm_choice == "GPT-3.5":
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo-1106")
st.write(f"{llm_choice} selected")
elif llm_choice == "GPT-4":
llm = OpenAI(temperature=0.1, model="gpt-4-1106-preview")
st.write(f"{llm_choice} selected")
elif llm_choice == "Gemini":
llm = Gemini(model="models/gemini-pro")
st.write(f"{llm_choice} selected")
elif llm_choice == "Cohere":
llm = Cohere(model="command", api_key=os.environ['COHERE_API_TOKEN'])
st.write(f"{llm_choice} selected")
return llm, llm_choice
select_embedding_model
Offers a dropdown for users to select the embedding model of their choice from a predefined list. I have included some of the top embedding models from Hugging Face’s MTEB leaderboard. Near the dropdown I have also included a handy link to the leaderboard where users can get more information about the embedding models.
def select_embedding_model():
st.header("Choose Embedding Model")
col1, col2 = st.columns([2,1])
with col2:
st.markdown("""
[Embedding Models Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
""")
model_names = [
"BAAI/bge-small-en-v1.5",
"WhereIsAI/UAE-Large-V1",
"BAAI/bge-large-en-v1.5",
"khoa-klaytn/bge-small-en-v1.5-angle",
"BAAI/bge-base-en-v1.5",
"llmrails/ember-v1",
"jamesgpt1/sf_model_e5",
"thenlper/gte-large",
"infgrad/stella-base-en-v2",
"thenlper/gte-base"
]
selected_model = st.selectbox("Select Embedding Model", model_names, on_change=reset_pipeline_generated)
with st.spinner("Please wait") as status:
embed_model = HuggingFaceEmbedding(model_name=selected_model)
st.session_state['embed_model'] = embed_model
st.markdown(F"Embedding Model: {embed_model.model_name}")
st.markdown(F"Embed Batch Size: {embed_model.embed_batch_size}")
st.markdown(F"Embed Batch Size: {embed_model.max_length}")
return embed_model, selected_model
select_node_parser Function
This function allows users to choose a node parser, which is instrumental in breaking down documents into manageable chunks or nodes, facilitating better handling and processing. I have included some of the most commonly used node parsers supported by Llamaindex, which include SentenceSplitter, CodeSplitter, SemanticSplitterNodeParser, TokenTextSplitter, HTMLNodeParser, JSONNodeParser and MarkdownNodeParser.
def select_node_parser():
st.header("Choose Node Parser")
col1, col2 = st.columns([4,1])
with col2:
st.markdown("""
[More Information](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/root.html)
""")
parser_types = ["SentenceSplitter", "CodeSplitter", "SemanticSplitterNodeParser",
"TokenTextSplitter", "HTMLNodeParser", "JSONNodeParser", "MarkdownNodeParser"]
parser_type = st.selectbox("Select Node Parser", parser_types, on_change=reset_pipeline_generated)
parser_params = {}
if parser_type == "HTMLNodeParser":
tags = st.text_input("Enter tags separated by commas", "p, h1")
tag_list = tags.split(',')
parser = HTMLNodeParser(tags=tag_list)
parser_params = {'tags': tag_list}
elif parser_type == "JSONNodeParser":
parser = JSONNodeParser()
elif parser_type == "MarkdownNodeParser":
parser = MarkdownNodeParser()
elif parser_type == "CodeSplitter":
language = st.text_input("Language", "python")
chunk_lines = st.number_input("Chunk Lines", min_value=1, value=40)
chunk_lines_overlap = st.number_input("Chunk Lines Overlap", min_value=0, value=15)
max_chars = st.number_input("Max Chars", min_value=1, value=1500)
parser = CodeSplitter(language=language, chunk_lines=chunk_lines, chunk_lines_overlap=chunk_lines_overlap, max_chars=max_chars)
parser_params = {'language': language, 'chunk_lines': chunk_lines, 'chunk_lines_overlap': chunk_lines_overlap, 'max_chars': max_chars}
elif parser_type == "SentenceSplitter":
chunk_size = st.number_input("Chunk Size", min_value=1, value=1024)
chunk_overlap = st.number_input("Chunk Overlap", min_value=0, value=20)
parser = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
parser_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}
elif parser_type == "SemanticSplitterNodeParser":
if 'embed_model' not in st.session_state:
st.warning("Please select an embedding model first.")
return None, None
embed_model = st.session_state['embed_model']
buffer_size = st.number_input("Buffer Size", min_value=1, value=1)
breakpoint_percentile_threshold = st.number_input("Breakpoint Percentile Threshold", min_value=0, max_value=100, value=95)
parser = SemanticSplitterNodeParser(buffer_size=buffer_size, breakpoint_percentile_threshold=breakpoint_percentile_threshold, embed_model=embed_model)
parser_params = {'buffer_size': buffer_size, 'breakpoint_percentile_threshold': breakpoint_percentile_threshold}
elif parser_type == "TokenTextSplitter":
chunk_size = st.number_input("Chunk Size", min_value=1, value=1024)
chunk_overlap = st.number_input("Chunk Overlap", min_value=0, value=20)
parser = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
parser_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}
# Save the parser type and parameters to the session state
st.session_state['node_parser_type'] = parser_type
st.session_state['node_parser_params'] = parser_params
return parser, parser_type
Below the node parser selection, I have also included a preview of the first node of the text after splitting/parsing, just to give the users an idea of how the chunking is actually happening based the selected node parser and the relevant parameters.
select_response_synthesis_method
This function allows users to choose how the RAG pipeline synthesizes responses. I have included varioud response synthesis methods supported by Llamaindex including refine, tree_summarize, compact, simple_summarize, accumulate and compact_accumulate.
Users can click on the more information link to get more details about response synthesis and the different types.
def select_response_synthesis_method():
st.header("Choose Response Synthesis Method")
col1, col2 = st.columns([4,1])
with col2:
st.markdown("""
[More Information](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/response_synthesizers.html)
""")
response_modes = [
"refine",
"tree_summarize",
"compact",
"simple_summarize",
"accumulate",
"compact_accumulate"
]
selected_mode = st.selectbox("Select Response Mode", response_modes, on_change=reset_pipeline_generated)
response_mode = selected_mode
return response_mode, selected_mode
select_vector_store
Enables users to choose a vector store, which is a critical component for storing and retrieving embeddings in the RAG pipeline. This function supports the selection from multiple vector store options including Simple (Llamaindex default), Pinecone and Qdrant.
def select_vector_store():
st.header("Choose Vector Store")
vector_stores = ["Simple", "Pinecone", "Qdrant"]
selected_store = st.selectbox("Select Vector Store", vector_stores, on_change=reset_pipeline_generated)
vector_store = None
if selected_store == "Pinecone":
pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
index = pc.Index("test")
vector_store = PineconeVectorStore(pinecone_index=index)
elif selected_store == "Qdrant":
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="sampledata")
st.write(selected_store)
return vector_store, selected_store
generate_rag_pipeline Function
This core function ties together the selected components to generate a RAG pipeline. It initializes the pipeline with the chosen LLM, embedding model, node parser, response synthesis method, and vector store. It is triggered by pressing the ‘Generate RAG Pipeline’ button.
def generate_rag_pipeline(file, llm, embed_model, node_parser, response_mode, vector_store):
if vector_store is not None:
# Set storage context if vector_store is not None
storage_context = StorageContext.from_defaults(vector_store=vector_store)
else:
storage_context = None
# Create the service context
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, node_parser=node_parser)
# Create the vector index
vector_index = VectorStoreIndex.from_documents(documents=file, storage_context=storage_context, service_context=service_context, show_progress=True)
if storage_context:
vector_index.storage_context.persist(persist_dir="persist_dir")
# Create the query engine
query_engine = vector_index.as_query_engine(
response_mode=response_mode,
verbose=True,
)
return query_engine
generate_code_snippet Function
This function is the culmination of the user’s selections, generating the Python code necessary to implement the configured RAG pipeline. It dynamically constructs the code snippet based on the chosen LLM, embedding model, node parser, response synthesis method, and vector store, including the parameters set for the node parser.
def generate_code_snippet(llm_choice, embed_model_choice, node_parser_choice, response_mode, vector_store_choice):
node_parser_params = st.session_state.get('node_parser_params', {})
print(node_parser_params)
code_snippet = "from llama_index.llms import OpenAI, Gemini, Cohere\n"
code_snippet += "from llama_index.embeddings import HuggingFaceEmbedding\n"
code_snippet += "from llama_index import ServiceContext, VectorStoreIndex, StorageContext\n"
code_snippet += "from llama_index.node_parser import SentenceSplitter, CodeSplitter, SemanticSplitterNodeParser, TokenTextSplitter\n"
code_snippet += "from llama_index.node_parser.file import HTMLNodeParser, JSONNodeParser, MarkdownNodeParser\n"
code_snippet += "from llama_index.vector_stores import MilvusVectorStore, QdrantVectorStore\n"
code_snippet += "import qdrant_client\n\n"
# LLM initialization
if llm_choice == "GPT-3.5":
code_snippet += "llm = OpenAI(temperature=0.1, model='gpt-3.5-turbo-1106')\n"
elif llm_choice == "GPT-4":
code_snippet += "llm = OpenAI(temperature=0.1, model='gpt-4-1106-preview')\n"
elif llm_choice == "Gemini":
code_snippet += "llm = Gemini(model='models/gemini-pro')\n"
elif llm_choice == "Cohere":
code_snippet += "llm = Cohere(model='command', api_key='<YOUR_API_KEY>') # Replace <YOUR_API_KEY> with your actual API key\n"
# Embedding model initialization
code_snippet += f"embed_model = HuggingFaceEmbedding(model_name='{embed_model_choice}')\n\n"
# Node parser initialization
node_parsers = {
"SentenceSplitter": f"SentenceSplitter(chunk_size={node_parser_params.get('chunk_size', 1024)}, chunk_overlap={node_parser_params.get('chunk_overlap', 20)})",
"CodeSplitter": f"CodeSplitter(language={node_parser_params.get('language', 'python')}, chunk_lines={node_parser_params.get('chunk_lines', 40)}, chunk_lines_overlap={node_parser_params.get('chunk_lines_overlap', 15)}, max_chars={node_parser_params.get('max_chars', 1500)})",
"SemanticSplitterNodeParser": f"SemanticSplitterNodeParser(buffer_size={node_parser_params.get('buffer_size', 1)}, breakpoint_percentile_threshold={node_parser_params.get('breakpoint_percentile_threshold', 95)}, embed_model=embed_model)",
"TokenTextSplitter": f"TokenTextSplitter(chunk_size={node_parser_params.get('chunk_size', 1024)}, chunk_overlap={node_parser_params.get('chunk_overlap', 20)})",
"HTMLNodeParser": f"HTMLNodeParser(tags={node_parser_params.get('tags', ['p', 'h1'])})",
"JSONNodeParser": "JSONNodeParser()",
"MarkdownNodeParser": "MarkdownNodeParser()"
}
code_snippet += f"node_parser = {node_parsers[node_parser_choice]}\n\n"
# Response mode
code_snippet += f"response_mode = '{response_mode}'\n\n"
# Vector store initialization
if vector_store_choice == "Pinecone":
code_snippet += "pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])\n"
code_snippet += "index = pc.Index('test')\n"
code_snippet += "vector_store = PineconeVectorStore(pinecone_index=index)\n"
elif vector_store_choice == "Qdrant":
code_snippet += "client = qdrant_client.QdrantClient(location=':memory:')\n"
code_snippet += "vector_store = QdrantVectorStore(client=client, collection_name='sampledata')\n"
elif vector_store_choice == "Simple":
code_snippet += "vector_store = None # Simple in-memory vector store selected\n"
code_snippet += "\n# Finalizing the RAG pipeline setup\n"
code_snippet += "if vector_store is not None:\n"
code_snippet += " storage_context = StorageContext.from_defaults(vector_store=vector_store)\n"
code_snippet += "else:\n"
code_snippet += " storage_context = None\n\n"
code_snippet += "service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, node_parser=node_parser)\n\n"
code_snippet += "_file = 'path_to_your_file' # Replace with the path to your file\n"
code_snippet += "vector_index = VectorStoreIndex.from_documents(documents=_file, storage_context=storage_context, service_context=service_context, show_progress=True)\n"
code_snippet += "if storage_context:\n"
code_snippet += " vector_index.storage_context.persist(persist_dir='persist_dir')\n\n"
code_snippet += "query_engine = vector_index.as_query_engine(response_mode=response_mode, verbose=True)\n"
return code_snippet
Conclusion
RAGArch stands at the intersection of innovation and practicality, offering a streamlined no-code approach to RAG pipeline development. It’s designed to demystify the complexities of AI configurations. With RAGArch, both seasoned developers and AI enthusiasts can craft custom pipelines with ease, accelerating the journey from idea to implementation.
Your insights and contributions are invaluable as I continue to evolve this tool. Check out RAGArch on Github and let’s start a conversation on Linkedin. I’m always eager to collaborate and share knowledge with fellow tech adventurers.