This notebook is an exploration of how a RAG could be created. The idea is to create a chatbot who can answer questions based on the documents it has been trained on.

Requirements are as follow:
* Source documents are in markdown format
* Source documents are stored in a git repository
* Everything needs to be self-hosted
  * An Ollama server is already running locally (https://localhost:11434)
* The interface is unimportant for now
  * Eventually, we want it to be a bot hosted in Teams and/or Discord

For this notebook, we will ingest the documentation of Bitburner.

>Bitburner is a programming-based incremental game that revolves around hacking and cyberpunk themes.

The documentation located in this repository: https://github.com/bitburner-official/bitburner-src.git and the documentation is within the *src/Documentation/* folder of this repository.

## Steps

### Step 1: Fetch the documents from the git repository

We will use `gitpython` to clone the repository and fetch the documents.


In [1]:
%pip install --quiet --upgrade langchain-community GitPython


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


First thing first, we need to clone the repository. If it already exists locally, we can just pull the latest changes.

In [2]:
import os

from git import Repo

repo_url = "https://github.com/bitburner-official/bitburner-src.git"
local_repo_path = "./bitburner"

if not os.path.exists(local_repo_path):
    Repo.clone_from(repo_url, local_repo_path)
else:
    repo = Repo(local_repo_path)
    repo.remotes.origin.pull()

Now that we have a local copy of the repository, we can find the documents we are interested in within and list them.

In [3]:
doc_root = os.path.join(local_repo_path, "src/Documentation")
doc_files = []

# Walk through the directory and find all markdown files
for root, dirs, files in os.walk(doc_root):
    for file in files:
        if file.endswith(".md"):
            doc_files.append(os.path.join(root, file))

print(f"Found {len(doc_files)} documents.")

Found 63 documents.


## Step 2: Index the documents in a vector database

To build our RAG, we need to store the documents in a vector database. Several options are available:
* [FAISS](https://faiss.ai/)
* [ChromaDB](https://www.trychroma.com/)
* [Qdrant](https://qdrant.tech/)
* [LangChain](https://langchain.com/)
* etc.

For this example, we will use LangChain because it's a very convenient an all-in-one framework that is commonly used in LLM applications. As for our backend, we'll  use Ollama because it's convenient to be able to run the models locally.

In [4]:
%pip install --quiet --upgrade langchain-community langchain-ollama langgraph


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


To index our documents:
1. We need convert our documents into vectors. For this, we can use a embedding model. *nomic-embed-text* should provide reasonable performance for our purpose.

In [5]:
from langchain_ollama.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model='nomic-embed-text')

2. Once we have our vector, we can store them in a database for future retrieval. LangChain conveniently provides us an `InMemoryVectorStore` which will do the job.

In [6]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

3. Now that we have our embedding model and our vector database, we can start indexing our documents. LangChain has 100+ `DocumentLoader`s to aid us with this task. The documentation is written in markdown, so we can use [UnstructuredMarkdownLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html#langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader) to load them.

In [7]:
%pip install --quiet --upgrade unstructured markdown


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader


documents =[]

for file in doc_files:
    loader = UnstructuredMarkdownLoader(
        file,
        mode='single',
        strategy='fast',
    )
    documents.append(loader.load())

print(f'Loaded {len(documents)} documents')

Loaded 63 documents


4. We can now store our documents into the database.

In [9]:
for document in documents:
    vector_store.add_documents(documents=document)

5. Finally, we can retrieve documents from our database.

In [10]:
vector_store.search(
    query='How to hack',
    search_type='similarity',
)

[Document(id='93e7fa11-5553-40c0-aef7-7e94299bdd72', metadata={'source': './bitburner/src/Documentation/doc/basic/hacking.md'}, page_content='Hacking\n\nIn the year 2077, currency has become digital and decentralized. People and corporations store their money on servers. By hacking these servers, you can steal their money and gain experience.\n\nGaining Root Access\n\nThe first step to hacking a server is to gain root access to that server. This can be done using the NUKE.exe virus. You start the game with a copy of the NUKE.exe virus on your home computer. The NUKE.exe virus attacks the target server\'s open ports using buffer overflow exploits. When successful, you are granted root administrative access to the machine.\n\nIn order for the NUKE.exe virus to succeed, the target server needs to have enough open ports. Some servers have no security and will not need any ports opened. Some will have very high security and will need many ports opened. In order to open ports on another serv

## Step 3: Build the RAG chain

There are a number of frameworks available to build our RAG.
* [LangChain](https://langchain.com/)
* [LlamaIndex](https://docs.llamaindex.ai/en/latest/)

In this example we will use LangChain.