top of page

No Clouds Allowed: Building an All Open Source Local RAG System

In today’s AI landscape, companies like Microsoft and Google offer sophisticated Retriever-Augmented Generation (RAG) solutions through platforms like Azure and GCP, simplifying the creation of AI applications with managed services. These services, such as Azure's AI search, boast powerful capabilities and manage vast quantities of documents with ease. However, such managed services can be costly, may lack customisation, and impose limitations like rate limits and model access. What if you could bypass these constraints? With an increasing number of powerful open-source models and accessible tools, setting up a local RAG system is not only more achievable but also allows for complete customisation and independence. This blog post will guide you through building your own local RAG system using tools like Ollama, Llama3, and Langchain, empowering you to create an AI solution that is 100% independent, right on your own machine.

1. Core Components of Our RAG System

A Retriever-Augmented Generation (RAG) system is an advanced AI tool designed to answer questions and generate text using your own collected data. Here's how our project breaks down into three key components:

  1. Data: From raw input to ready-to-use, this component preps and stores your info for quick access.

  2. Retriever: Think of it as a smart search engine that finds the best bits of data to answer your queries.

  3. Chat Model: This is the clever conversationalist that takes the info from the Retriever and chats back with answers that make sense.

2. Setting up

Before we dive into the coding part of our project, there's some essential prep work to do. First, download and install oLlama on your machine to enable local use of Llama 3. Next, obtain the dataset we'll be using to feed into our RAG system. Finally, set up a dedicated project directory on your machine to keep everything organized and accessible.

Llama 3

  1. Start by visiting the oLlama download page here and pick the right version based on your operating system.

  2. After downloading and installing oLlama, open your terminal and download Llama 3 by running this command.

> ollama pull llama3


For this project, we’re diving into the world of Langchain documentation! I've curated a set of files for us by carefully selecting and scraping the documentation pages from their website. After cleaning out the clutter and trimming down to just the essentials, I've transformed them into simple markdown files. These files will serve as the knowledge base for our AI, teaching it to understand and generate responses based on the Langchain docs.

Download ZIP • 364KB

Project Structure

Now, let's setup our AI project! Whether you're using VS Code, PyCharm, or any other IDE that you prefer, you can arrange your project space in a way that suits you best. However, if you'd like to mirror my setup to follow this guide as closely as possible, here’s how I’ve organised my workspace:

  • src 📁: This folder houses all our files.

  • data 📁: Here’s where our previously mentioned Langchain dataset resides.

  • chroma 📁: This directory is where we will be saving files regarding our vector store.

  • LangchainRAG.ipynb 📄: This is the python notebook where we will be writing our code.

3. The Code: Bringing Our AI to Life

Now that our setup is complete, it's time to roll up our sleeves and dive into the code! First things first, let's get all the necessary dependencies installed. This will ensure our project has everything it needs to run smoothly and efficiently.

%pip install -qU langchain langchain-core langchain_community pandas

4. Data

Loading our Markdown files

I have defined a simple function to load our markdown files. This function accepts a directory as input and outputs two lists: one with the markdown files and another with their respective names.

import os

def load_markdown_files(directory_path):
    Loads all markdown files from the specified directory.

        directory_path (str): Path to the directory containing markdown files.

        tuple: A tuple containing two lists:
            - List of file contents (each position corresponds to a file).
            - List of file names (each position corresponds to a file name).
    file_contents = []  # List to store file contents
    file_names = []     # List to store file names

    # Iterate through files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith(".md"):  # Check if it's a markdown file
            file_path = os.path.join(directory_path, filename)
            with open(file_path, "r", encoding="utf-8") as file:
                content =

    return file_contents, file_names

After defining it, we simply need to specify where our data is stored and then call the function.

markdown_dir = 'data'
markdown_files, markdown_names = load_markdown_files('data')

Pass files into LangchainDocument list

Now that we have our markdown files and names neatly loaded into lists, the next step is to transform this raw data into something more structured and powerful. We'll convert our files into a list of LangchainDocuments.

from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(page_content=doc, metadata={"source": markdown_names[i]})
    for i,doc in enumerate(markdown_files)
print(f'The number of full sized files is {len(langchain_docs)}')

This code snippet creates a LangchainDocument for each markdown file, tagging each with its source for easy reference. It’s a simple but crucial step that sets us up for what we need to do next.

Chunking our LangchainDocuments

Now that each file has been transformed into a LangchainDocument, it's time to tackle the text in a more granular fashion. We'll use the RecursiveCharacterTextSplitter to divide our extensive documents into manageable chunks. This step is vital for enhancing the processing speed and effectiveness of our RAG system.

For this implementation, we will split the documents into chunks of 2000 characters each, with an overlap of 200 characters to maintain context across chunks. We also define a list of predefined separators to intelligently segment the text. It prioritises these separators, attempting to split at the first one before moving on to the others, ensuring natural breaks in the text.

Here’s how we set it up:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\\n\\n", "\\n", ".", " ", ""],

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

print(f'The number of chunked files is {len(docs_processed)}')

Embed and Vector Store

After chunking our documents, the next crucial step is embedding each chunk and storing these representations in a vector store. This process transforms each text chunk into a vector that captures its semantic essence using an embedding model. These vectors are then saved in a vector store, facilitating efficient retrieval based on query similarity.

For this project, we're utilising GPT4ALL as our embedding model to generate these semantic vectors. We will store these vectors using Chroma, a robust database designed for handling large-scale vector data efficiently.

Embeddings convert complex items like words or images into vectors, simplifying computer processing. In vector stores, these are organized such that similar items have similar vectors, enabling fast search for similar items in large data sets. For instance, if you search for a word, the vector store quickly finds semantically similar words by looking for close vectors.

Here’s how we initially set up our vector store and populate it with embedded documents:

from langchain.vectorstores import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings

# Define the directory and database name for Chroma
chroma_dir = './chromadb'
db_name = 'langchain_db'

# Create a vector store from the processed documents
vectorstore = Chroma.from_documents(documents=docs_processed,

2. Retreiver

With our vector store established, we're now ready to build and test our retriever. This component is crucial as it takes a user's query, embeds it using the same model that processed our chunks, and performs a similarity search. This search compares the query's embedding against all stored chunk embeddings, returning the most relevant top k chunks.

Since all documents are already embedded and stored, we avoid the overhead of re-embedding them for each query. Here’s how we efficiently load the database:

# Load the existing vector store
vectorstore = Chroma(collection_name=db_name,

Now, we will create a retriever from our vector store. For this example, let's configure it to return the top 2 documents related to the query:

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})

Finally, let’s put our system to the test with a simple query to see how our retriever performs in fetching relevant information:

vector_query = retriever.invoke('How can I keep chat history in my langchain code?')[0].page_content

In our case, we used the sentence "How can I keep chat history in my langchain code?" as our query. The retriever converts this sentence into a query using the GPT4ALL embedding model, and then retrieves the top 2 chunks, with their embedding vector most similar to the query. Let's print out the results to check them out 🧐:


3. Chat Model: Crafting Conversations with AI

Having previously set up our Llama 3 model, we are now ready to create the final component of our RAG system.

Setting Up the Chat Model

Here is all it takes to get our Llama3 language model up and running:

from langchain_community.chat_models import ChatOllama

model = ChatOllama(model='llama3', temperature=0)

This configuration initialises the model with a zero temperature, to make the responses more consistent and deterministic.

Crafting the Prompt Template

Next, we design a prompt template that our Chat Model will use. This template requires two key inputs:

  1. Question: The user's query, which defines the problem.

  2. Context: The relevant chunks provided by our retriever, offering the necessary background information.

When creating our prompt template we also need take into account specific syntax of our language model. Refer to Llama’s official documentation for detailed guidelines.

Here’s the prompt template we'll employ:

from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    You are a helpful AI assistant for question-answering tasks.
    Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say "I don't know".
    Question: {question}
    Context: {context}
    Answer: assistant
    input_variables=["question", "context"],

This template instructs Llama3 to act as a helpful assistant tasked with using the provided context to answer questions directly.

Creating the Chat Chain

Finally, we assemble our chat chain, which seamlessly integrates the prompt template with our model and parses the output to format it appropriately:

rag_chain = prompt | model | StrOutputParser()

This chain simplifies the flow from user input through AI processing to user-readable output, making our chat model not only functional but also efficient. Now, let's test and see our Chat Model in action, ready to tackle real-world questions with accuracy and relevance.

4. Grand Finale: Putting Our RAG System to the Test

We've built and configured every component of our RAG system—now it's time to see it in action! Let's conduct a test to demonstrate how effectively our system can handle real queries. Here is the Stack Overflow question we will use to test out our system: What does langchain CharacterTextSplitter's chunk_size param even do?

Testing Our RAG System

Here’s how we test our setup with a practical question:

pythonCopiar código
# Define the question
question = "What does langchain CharacterTextSplitter's chunk_size param even do?"

# Retrieve relevant documents using the retriever
docs = retriever.invoke(question)
content = "\\n\\n".join([doc.page_content for doc in docs])
print("Retrieved Context:")

# Execute the response generation chain
response = rag_chain.invoke({"question": question, "context": content})

# Display the generated response
print("AI's Response:")

RAG System Output

This was the response that our system was able to generate:

In the context of the provided code, the RecursiveCharacterTextSplitter is used with the language parameter specified as Python. This suggests that the splitter is being used to split Python code into chunks based on character-level splits, rather than relying solely on syntax-based splitting.

By setting the chunk_size parameter, you can adjust the size of these chunks to suit your specific use case. For example, if you want to split the text into smaller chunks for faster processing or to reduce memory usage, you could set a lower chunk_size. Conversely, if you need to preserve more context in each chunk, you might set a higher chunk_size.

Overall, the chunk_size parameter provides a way to fine-tune the text splitting process and adapt it to your specific requirements.

This response is incredibly clear, detailed, and even explains the code provided by the retriever. While our system is simple and has room for improvement, it demonstrates that setting up a basic RAG system is an attainable goal. It can also serve as a foundation for a more sophisticated, specialised system to address your specific needs.


Sign up to get updates when we release another amazing article

Thanks for subscribing!

bottom of page