This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.

Requirements are as follow:
* Source documents are in markdown format
* Source documents are stored in a git repository
* Everything needs to be self-hosted
  * An Ollama server is already running locally (https://localhost:11434)
* The interface is unimportant for now
  * Eventually, we want it to be a bot hosted in Teams and/or Discord

## Steps

### Step 1: Fetch the documents from the git repository

We will use `gitpython` to clone the repository and fetch the documents.

For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.

First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes.

In [1]:
import os

from git import Repo

repo_url = "https://github.com/prometheus-operator/prometheus-operator.git"
local_repo_path = "./prometheus-operator"

if not os.path.exists(local_repo_path):
    Repo.clone_from(repo_url, local_repo_path)
else:
    repo = Repo(local_repo_path)
    repo.remotes.origin.pull()

Now that we have a local copy of the repository, we can find the documents we are interested in within and list them.

In [2]:
documentation_root = os.path.join(local_repo_path, "Documentation")
documentation_files = []

# Walk through the directory and find all markdown files
for root, dirs, files in os.walk(documentation_root):
    for file in files:
        if file.endswith(".md"):
            documentation_files.append(os.path.join(root, file))

print(f"Found {len(documentation_files)} documents.")

Found 40 documents.


## Step 2: Ingest the documents in a vector database

To build our RAG, we need to store the documents in a vector database. Several options are available:
* [FAISS](https://faiss.ai/)
* [ChromaDB](https://www.trychroma.com/)
* [Qdrant](https://qdrant.tech/)
* etc.

For this example, we will use ChromaDB because it is easy to set up and use. Helpfully, ChromaDB is able to automatically generate embeddings for us. We will store the documents in a collection called `documentation`. The collection will live in-memory, but in a more complete setup, we could setup Chroma in an client-server mode and/or with persistence enabled.

In [3]:
import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="documentation")

# Read the contents of each document and store them in a list
documents = []
for file in documentation_files:
    with open(file, "r") as f:
        content = f.read()
        documents.append(content)

# Add the documents to the collection
collection.add(documents=documents, ids=documentation_files)

Now that we have built our collection, we can try to query it. Let's search for a document about prometheus.

In [8]:
import json # we will use this to pretty-print the result

result = collection.query(
    query_texts=["This is a document about prometheus."],
    n_results=3, # how many results to return (10 by default)
)
print(json.dumps(result, indent=2))

{
  "ids": [
    [
      "./prometheus-operator/Documentation/platform/prometheus-agent.md",
      "./prometheus-operator/Documentation/proposals/202201-prometheus-agent.md",
      "./prometheus-operator/Documentation/additional-scrape-config.md"
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "---\nweight: 204\ntoc: true\ntitle: Prometheus Agent\nmenu:\n    docs:\n        parent: user-guides\nlead: \"\"\nimages: []\ndraft: false\ndescription: Guide for running Prometheus in Agent mode\n---\n\n{{< alert icon=\"\ud83d\udc49\" text=\"Prometheus Operator >= v0.64.0 is required.\"/>}}\n\nAs mentioned in [Prometheus's blog](https://prometheus.io/blog/2021/11/16/agent/), Prometheus Agent\nis a deployment model optimized for environments where all collected data is forwarded to\na long-term storage solution, e.g. Cortex, Thanos or Prometheus, that do not need storage or rule evaluation.\n\nFirst of all, make sure that the PrometheusAgent CRD is installed in the cluster and that 