rag step1

This commit is contained in:
Massaki Archambault 2025-02-05 23:14:33 -05:00
parent 174a039322
commit f4cad805b7
4 changed files with 129 additions and 1 deletions

View File

@ -3,4 +3,6 @@ jupyterlab-lsp
python-lsp-server[all] python-lsp-server[all]
matplotlib matplotlib
pandas pandas
GitPython

3
work/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
*
!.gitignore
!*.ipynb

View File

123
work/rag.ipynb Normal file
View File

@ -0,0 +1,123 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.\n",
"\n",
"Requirements are as follow:\n",
"* Source documents are in markdown format\n",
"* Source documents are stored in a git repository\n",
"* Everything needs to be self-hosted\n",
" * An Ollama server is already running locally (https://localhost:11434)\n",
"* The interface is unimportant for now\n",
" * Eventually, we want it to be a bot hosted in Teams and/or Discord"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Steps\n",
"\n",
"### Step 1: Fetch the documents from the git repository\n",
"\n",
"We will use `gitpython` to clone the repository and fetch the documents.\n",
"\n",
"For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.\n",
"\n",
"First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from git import Repo\n",
"\n",
"repo_url = \"https://github.com/prometheus-operator/prometheus-operator.git\"\n",
"local_repo_path = \"./prometheus-operator\"\n",
"\n",
"if not os.path.exists(local_repo_path):\n",
" Repo.clone_from(repo_url, local_repo_path)\n",
"else:\n",
" repo = Repo(local_repo_path)\n",
" repo.remotes.origin.pull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a local copy of the repository, we can find the documents we are interested in within and list them."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 40 documents.\n"
]
}
],
"source": [
"documentation_root = os.path.join(local_repo_path, \"Documentation\")\n",
"documentation_files = []\n",
"\n",
"# Walk through the directory and find all markdown files\n",
"for root, dirs, files in os.walk(documentation_root):\n",
" for file in files:\n",
" if file.endswith(\".md\"):\n",
" documentation_files.append(os.path.join(root, file))\n",
"\n",
"print(f\"Found {len(documentation_files)} documents.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Ingest the documents in a vector database\n",
"\n",
"To build our RAG, we need to store the documents in a vector database. Several options are available:\n",
"* [FAISS](https://faiss.ai/)\n",
"* [ChromaDB](https://www.trychroma.com/)\n",
"* [Qdrant](https://qdrant.tech/)\n",
"* etc.\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}