rag step1

2025-02-05 23:14:33 -05:00 · 2025-02-05 23:14:33 -05:00 · f4cad805b7
parent 174a039322
commit f4cad805b7
4 changed files with 129 additions and 1 deletions
--- a/requirements.txt
+++ b/requirements.txt
@ -3,4 +3,6 @@ jupyterlab-lsp
 python-lsp-server[all]
 matplotlib
-pandas
+pandas
 GitPython
--- a/work/.gitignore
+++ b/work/.gitignore
@ -0,0 +1,3 @@
 *
 !.gitignore
 !*.ipynb
--- a/work/.keep
+++ b/work/.keep
--- a/work/rag.ipynb
+++ b/work/rag.ipynb
@ -0,0 +1,123 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.\n",
    "\n",
    "Requirements are as follow:\n",
    "* Source documents are in markdown format\n",
    "* Source documents are stored in a git repository\n",
    "* Everything needs to be self-hosted\n",
    "  * An Ollama server is already running locally (https://localhost:11434)\n",
    "* The interface is unimportant for now\n",
    "  * Eventually, we want it to be a bot hosted in Teams and/or Discord"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Steps\n",
    "\n",
    "### Step 1: Fetch the documents from the git repository\n",
    "\n",
    "We will use `gitpython` to clone the repository and fetch the documents.\n",
    "\n",
    "For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.\n",
    "\n",
    "First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from git import Repo\n",
    "\n",
    "repo_url = \"https://github.com/prometheus-operator/prometheus-operator.git\"\n",
    "local_repo_path = \"./prometheus-operator\"\n",
    "\n",
    "if not os.path.exists(local_repo_path):\n",
    "    Repo.clone_from(repo_url, local_repo_path)\n",
    "else:\n",
    "    repo = Repo(local_repo_path)\n",
    "    repo.remotes.origin.pull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a local copy of the repository, we can find the documents we are interested in within and list them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 40 documents.\n"
     ]
    }
   ],
   "source": [
    "documentation_root = os.path.join(local_repo_path, \"Documentation\")\n",
    "documentation_files = []\n",
    "\n",
    "# Walk through the directory and find all markdown files\n",
    "for root, dirs, files in os.walk(documentation_root):\n",
    "    for file in files:\n",
    "        if file.endswith(\".md\"):\n",
    "            documentation_files.append(os.path.join(root, file))\n",
    "\n",
    "print(f\"Found {len(documentation_files)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Ingest the documents in a vector database\n",
    "\n",
    "To build our RAG, we need to store the documents in a vector database. Several options are available:\n",
    "* [FAISS](https://faiss.ai/)\n",
    "* [ChromaDB](https://www.trychroma.com/)\n",
    "* [Qdrant](https://qdrant.tech/)\n",
    "* etc.\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }