Bible & Notes RAG Bible-RAG-md

# Sovereign Bible + Notes RAG System Specification

**Project Title**
Personal Bible Study Enhancement via Local Llama-Powered Retrieval-Augmented Generation (RAG)

**Author:** David DiPaola
**Version:** 1.2 – Updated February 23, 2026
**Status:** Active MVP (core components loading, GPU inference confirmed, ingestion & retrieval pipeline in progress)

## 1. Goal & Motivation

Develop a fully **local, sovereign** RAG system using **Llama-3B (quantized)** to query:
– World English Bible – WEB (JSON format from web version)
– Personal study notes (PDF, PPTX, DOCX formats)

The system provides **reasoned, context-aware responses** grounded in scripture references and personal insights, with step-by-step theological reasoning and accurate citations (Book, Chapter, Verse).

**Core Principles**
– **Data sovereignty** — no cloud APIs, no data leaves the device
– Runs efficiently on consumer hardware (single RTX 3070, 8 GB VRAM)
– AI-assisted and augmented development workflow to accelerate quality code generation
– Modular, testable, and extensible (future: fine-tuning, voice input, multi-modal notes)

**Success Metrics** (MVP targets)
– Inference latency: < 5–8 seconds per query on 8 GB GPU
– Retrieval precision@5: ≥ 90% relevant chunks
– Faithfulness score (Ragas): ≥ 95% (minimal hallucinations)
– Accurate verse/note citations in ≥ 90% of responses
– Code coverage (Pytest): > 80%

## 2. Environment & Tooling

**Base Environment**
– OS: Windows 10
– Conda env: `llama_RAG4` (Python 3.12)
– IDE: Spyder (Anaconda)
– GPU: NVIDIA RTX 3070 (8 GB VRAM, CUDA 12.4 toolkit via PyTorch, driver supports up to 13.1)

**Key Libraries & Versions** (verified Feb 2026)
– torch: 2.5.1 (CUDA available: True, device: RTX 3070)
– CUDA: 12.4
– llama-cpp-python: 0.3.16 (GPU offload confirmed)
– sentence-transformers: 5.2.2
– transformers: 4.57.6
– faiss: 1.12.0 (CPU version – preserves VRAM for LLM)
– langchain: 1.2.10
– langchain-community: 0.4.1
– langchain-huggingface: 1.2.0
– datasets: 4.5.0
– numpy: 1.26.4 / pandas: 2.3.3

**Setup Notes**
– Separate envs considered for fine-tuning vs. inference (to avoid conflicts)
– FAISS on CPU (ms latency difference negligible for personal use)
– Llama models: Quantized GGUF (start with TinyLlama-1.1B Q4_K_M for testing, target Llama-3B/8B for Bible RAG)
– Git: Version control with feature branches (e.g., `feat/ingestion`, `feat/rag-chain`)

## 3. Architecture Overview

See flowchart on website (https://www.dceams.com/wp-content/uploads/2026/02/Workflow-FlowChart-scaled.jpg).

**Ingestion (Offline)**
– Sources → Loaders (JSON for Bible, PyPDF/PPTX/Docx for notes)
– Chunking → RecursiveCharacterTextSplitter (chunk_size=500–800, overlap=50–100, respect verse boundaries)
– Embeddings → sentence-transformers/all-mpnet-base-v2
– Storage → FAISS index (local, CPU)

**Query-Time RAG**
– User Prompt → Bi-Encoder semantic search (dense retrieval)
– Top-k → Cross-Encoder reranker (precision boost)
– Retrieved chunks + metadata → PromptTemplate (reasoning chain)
– Prompt + context → LangChain orchestration (RetrievalQA or create_retrieval_chain)
– Generation → LlamaCpp (llama-cpp-python backend, quantized Llama-3B, GPU offload)
– Output → Reasoned response with citations

## 4. Modular Design & Microstages

1. **Data Ingestion & Parsing**
– Bible: JSON → structured chunks with metadata (Book, Chapter, Verse)
– Notes: Multi-format loaders → extract text + preserve structure

2. **Embedding & Indexing**
– HuggingFaceEmbeddings wrapper
– FAISS.from_documents() with metadata

3. **Retrieval Pipeline**
– Bi-Encoder semantic search via `vectorstore.as_retriever(search_type=”similarity”, search_kwargs={“k”: 5–8})` (dense vector similarity using pre-computed embeddings from all-mpnet-base-v2)
– Cross-Encoder reranking

4. **Prompt Engineering**
– ChatPromptTemplate or PromptTemplate
– System prompt: “You are a wise Bible scholar. Use provided context only. Reason step-by-step with citations.” – This is being enhanced

5. **Generation Chain**
– RetrievalQA
– Or modern: create_retrieval_chain + Runnable

6. **Output Parsing**
– Extract citations, reasoning steps, final answer

## 5. Testing & Validation

**Unit/Integration**
– Pytest: Modular tests for loaders, chunking, embedding, retrieval, generation
– Target: >80% coverage

**RAG-Specific Evaluation**
– Ragas: RAG Triad (Context Relevance, Faithfulness, Answer Correctness)
– Manual review: 25-50 sample queries (theological accuracy, hallucination check)

**Performance**
– Latency profiling (cProfile or simple timing)
– VRAM monitoring (torch.cuda.memory_summary())

## 6. Optimization & Constraints

– Quantization: 4-bit or 5-bit GGUF
– GPU offload: n_gpu_layers=-1 (full possible)
– Context: n_ctx=4096–8192 (balance VRAM vs. coherence)
– Batch: n_batch=512+
– Memory tricks: Gradient checkpointing, FP16 if needed

## 7. Development Workflow (AI-Augmented)

Follow site workflow:
1. Research (tutorials, docs)
2. Goal definition
3. Markdown spec (this document – living)
4. Modular builds + AI code gen
5. Test early/often (Pytest + Ragas)
6. Refactor + document
7. Retrospective & iterate

## 8. Risks & Mitigations

– Dependency conflicts → Use clone envs for testing
– Hallucinations → Strict prompts + Ragas faithfulness monitoring
– VRAM Known Perofrmance → Llama 3B (ok)
– Model coherence → Upgrade to Llama-8B after MVP

## 9. Future Extensions

– Fine-tune Llama on theology corpus
– Add conversational memory
– Multi-modal notes (images, audio)
– Enterprise analog: Sovereign RAG for internal research synthesis

**Last Updated:** February 23, 2026
**Next Milestone:** Full ingestion → indexing → end-to-end query test with Llama-3B