RAGOps: Production-Grade RAG Platform

Personal Project

2024 – 2025
01

Problem

Naive RAG systems built on embedding-only retrieval suffer from poor recall on out-of-distribution queries, context pollution from irrelevant chunks, and zero visibility into why a given answer was produced. There was no systematic way to measure or improve retrieval quality over time.

02

Constraints

  • Heterogeneous document formats (PDF, markdown, HTML) required a unified ingestion pipeline
  • Query latency budget: end-to-end response under 3 seconds including reranking
  • Cost per query had to remain viable for self-hosted, single-tenant use
  • Evaluation required ground-truth labels — 150 QA pairs curated manually
  • No managed vector database; pgvector on PostgreSQL to keep the stack minimal
03

Approach

Replaced single-stage dense retrieval with a three-stage pipeline: (1) dual retrieval combining pgvector ANN search with BM25-style lexical matching, (2) score fusion to merge candidate lists, and (3) a cross-encoder reranker applied to the top-k candidates before passing context to the LLM. Chunking strategy was switched from fixed-size to semantic boundaries to improve chunk coherence. A fallback gate rejects low-confidence queries rather than hallucinating. Evaluation was embedded into the development loop — every pipeline change was measured against the 150-query benchmark before merging.

04

Architecture

RAGOps system architecture: document ingestion through hybrid retrieval, reranking, and LLM generation with observability tracing
Ingestion → Chunker → Embedder → pgvector + BM25 index → Fusion → Cross-encoder reranker → LLM generation → Traced response
05

Metrics

MetricBaselineAchieved
Recall@10~58%~81%
Answer precision (manual)62%84%
Irrelevant context rate31%11%
Avg query latency1.1 s2.4 s (reranker added)
Benchmark queries0150 QA pairs
06

Product Impact

RAGOps functions as a self-hostable knowledge-base Q&A system for domain-specific document corpora. The observability dashboard lets an operator debug retrieval failures without re-running experiments manually. The evaluation framework enables confident iteration — any retrieval change is quantified before deployment, treating the LLM application as infrastructure rather than a prototype.

07

Tech Stack

Python
FastAPI
PostgreSQL
pgvector
Redis
Celery
Next.js
TypeScript
BM25
Cross-encoder reranker
LLM API
08

Links