This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.

Requirements are as follow:
* Source documents are in markdown format
* Source documents are stored in a git repository
* Everything needs to be self-hosted
 * An Ollama server is already running locally (https://localhost:11434)
* The interface is unimportant for now
 * Eventually, we want it to be a bot hosted in Teams and/or Discord

## Steps

### Step 1: Fetch the documents from the git repository

We will use `gitpython` to clone the repository and fetch the documents.

For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.

First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes.

In [1]:
import os

from git import Repo

repo_url = "https://github.com/prometheus-operator/prometheus-operator.git"
local_repo_path = "./prometheus-operator"

if not os.path.exists(local_repo_path):
 Repo.clone_from(repo_url, local_repo_path)
else:
 repo = Repo(local_repo_path)
 repo.remotes.origin.pull()

Now that we have a local copy of the repository, we can find the documents we are interested in within and list them.

In [2]:
documentation_root = os.path.join(local_repo_path, "Documentation")
documentation_files = []

# Walk through the directory and find all markdown files
for root, dirs, files in os.walk(documentation_root):
 for file in files:
 if file.endswith(".md"):
 documentation_files.append(os.path.join(root, file))

print(f"Found {len(documentation_files)} documents.")

Found 40 documents.


## Step 2: Ingest the documents in a vector database

To build our RAG, we need to store the documents in a vector database. Several options are available:
* [FAISS](https://faiss.ai/)
* [ChromaDB](https://www.trychroma.com/)
* [Qdrant](https://qdrant.tech/)
* etc.


