{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.\n", "\n", "Requirements are as follow:\n", "* Source documents are in markdown format\n", "* Source documents are stored in a git repository\n", "* Everything needs to be self-hosted\n", " * An Ollama server is already running locally (https://localhost:11434)\n", "* The interface is unimportant for now\n", " * Eventually, we want it to be a bot hosted in Teams and/or Discord" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Steps\n", "\n", "### Step 1: Fetch the documents from the git repository\n", "\n", "We will use `gitpython` to clone the repository and fetch the documents.\n", "\n", "For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.\n", "\n", "First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from git import Repo\n", "\n", "repo_url = \"https://github.com/prometheus-operator/prometheus-operator.git\"\n", "local_repo_path = \"./prometheus-operator\"\n", "\n", "if not os.path.exists(local_repo_path):\n", " Repo.clone_from(repo_url, local_repo_path)\n", "else:\n", " repo = Repo(local_repo_path)\n", " repo.remotes.origin.pull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a local copy of the repository, we can find the documents we are interested in within and list them." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 40 documents.\n" ] } ], "source": [ "documentation_root = os.path.join(local_repo_path, \"Documentation\")\n", "documentation_files = []\n", "\n", "# Walk through the directory and find all markdown files\n", "for root, dirs, files in os.walk(documentation_root):\n", " for file in files:\n", " if file.endswith(\".md\"):\n", " documentation_files.append(os.path.join(root, file))\n", "\n", "print(f\"Found {len(documentation_files)} documents.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Ingest the documents in a vector database\n", "\n", "To build our RAG, we need to store the documents in a vector database. Several options are available:\n", "* [FAISS](https://faiss.ai/)\n", "* [ChromaDB](https://www.trychroma.com/)\n", "* [Qdrant](https://qdrant.tech/)\n", "* etc.\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 2 }