generated from badjware/jupiter-notebook-template
124 lines
3.5 KiB
Plaintext
124 lines
3.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.\n",
|
|
"\n",
|
|
"Requirements are as follow:\n",
|
|
"* Source documents are in markdown format\n",
|
|
"* Source documents are stored in a git repository\n",
|
|
"* Everything needs to be self-hosted\n",
|
|
" * An Ollama server is already running locally (https://localhost:11434)\n",
|
|
"* The interface is unimportant for now\n",
|
|
" * Eventually, we want it to be a bot hosted in Teams and/or Discord"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Steps\n",
|
|
"\n",
|
|
"### Step 1: Fetch the documents from the git repository\n",
|
|
"\n",
|
|
"We will use `gitpython` to clone the repository and fetch the documents.\n",
|
|
"\n",
|
|
"For this notebook, we will ingest the documentation of prometheus-operator located in this repository: https://github.com/prometheus-operator/prometheus-operator.git. The documentation is within the *Documentation/* folder of this repository.\n",
|
|
"\n",
|
|
"First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"from git import Repo\n",
|
|
"\n",
|
|
"repo_url = \"https://github.com/prometheus-operator/prometheus-operator.git\"\n",
|
|
"local_repo_path = \"./prometheus-operator\"\n",
|
|
"\n",
|
|
"if not os.path.exists(local_repo_path):\n",
|
|
" Repo.clone_from(repo_url, local_repo_path)\n",
|
|
"else:\n",
|
|
" repo = Repo(local_repo_path)\n",
|
|
" repo.remotes.origin.pull()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that we have a local copy of the repository, we can find the documents we are interested in within and list them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Found 40 documents.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"documentation_root = os.path.join(local_repo_path, \"Documentation\")\n",
|
|
"documentation_files = []\n",
|
|
"\n",
|
|
"# Walk through the directory and find all markdown files\n",
|
|
"for root, dirs, files in os.walk(documentation_root):\n",
|
|
" for file in files:\n",
|
|
" if file.endswith(\".md\"):\n",
|
|
" documentation_files.append(os.path.join(root, file))\n",
|
|
"\n",
|
|
"print(f\"Found {len(documentation_files)} documents.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2: Ingest the documents in a vector database\n",
|
|
"\n",
|
|
"To build our RAG, we need to store the documents in a vector database. Several options are available:\n",
|
|
"* [FAISS](https://faiss.ai/)\n",
|
|
"* [ChromaDB](https://www.trychroma.com/)\n",
|
|
"* [Qdrant](https://qdrant.tech/)\n",
|
|
"* etc.\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "env",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|