A few months ago, I built my first RAG system using Palantir’s platform (see my journey there: AI Baby Sleep Assistant using RAG and Palantir AIP).

The system worked well but it felt like a black box. I wanted to know: what it would take to build something similar myself? After over a decade of building data products, infrastructure surface level understanding wasn’t enough. This post isn’t a step-by-step tutorial. It’s a story of my journey learning how RAG systems really work, told in a way that’s accessible to you whether you are a developer or simply curious.

My RAG system — deployed locally

Foundation and Discovery

At first, RAG systems seemed straightforward: ingest documents, split them, embed them, run search queries. But as I dug deeper and worked through materials like “Advanced LLMs with Retrieval Augmented Generation: Practical Projects for AI Applications” course, I came to realize how much complexity hides each step.

Document ingestion and parsing — modern PDF parsers don not just extract text they also preserve structure and metadata such as reading order (page numbers), tables, formulas, code and even images. This can be brought in as additional crucial context in the form of metadata.

Chunking strategies — this isn’t just about the idea of splitting text/words but splitting text in a spot in what way that ensures that context is retained. Essentially there are two forces, avoiding chunks so broad that irrelevant information is captured or where they exceed the context window of the model and on the other side, chunks that are too small resulting in an inadequate response due to a lack of context.

Embedding models are not created equal — some models excel at understanding text, some in images or tables, whilst others can handle multiple modalities effectively. Choosing the right one for your use case matters.

Keyword search still matters — especially for academic papers, where exact terminology (“sleep regression” vs. “sleep disruption”) cannot be captured by semantic similarity alone.

LLMs can be tamed — not only by carefully designed prompts, optimized context windows but also with structured outputs. Structured outputs (such as establishing data contracts with Instructor) can be viewed as Guard Rails of the LLMs with the goal of minimizing hallucinations through setting an expectation of what the output should expect to consist of (fields, data types etc) and improving the response quality.

I had a prototype running locally of Dave Ebbelaar’s PostgreSQL approach using Pgvectorscale. The system performed well but I knew it was incomplete; it lacked keyword search and a user interface.

Hybrid Search Breakthrough

While pure vector search was powerful, I noticed its limitations since the Knowledge Base I created was based on academic papers. It returned chunks that were contextually similar to my search queries but missed where specific wording mattered. In academia, precision is everything. For example, “sleep regression” is not the same as “sleep disruption”, at the time, the system could not reliably make that distinction.

This is where BM25 (Best Matching 25) came in. BM25 is an evolution of TF-IDF, which measures how important a word is as a function of how many times it appears in a document. While both methods measure document relevancy, BM25 improves on its predecessor in two key ways:

  • Word frequency bias — TF-IDF assumes a high frequency of a term equates to a higher level of importance in the given document. BM25 addresses this bias by diminishing the returns a frequent term has on the overall ranking of that document. The 11th mention of a term matters less than the first 10.
  • Document length normalization — long documents tend to naturally contain more words, therefore there is a higher likelihood for these documents to contain the searched keywords/terms than in shorter documents. BM25 therefore penalizes documents that are longer than average.

The challenge wasn’t just including a BM25 search engine into the system but it was combining it with the vector search engine in a meaningful way. My intuition and first attempt was a simple weighted average of each search engine (e.g. 70% vector, 30% BM25). The results and responses that were returned felt inconsistent and not as I had expected.

That’s when I discovered Reciprocal Rank Fusion (RRF). Unlike my initial approach, RRF focuses only on rankings. It merges the top results from each search engine into a single, balanced list. Because it’s model-agnostic, normalization of the document score is not required.

The result was a hybrid retrieval system that combined semantic search with the precision of keyword search. Responses became richer, more precise and suddenly the system seemed to return answers closer to how a human would search and interpret documents. Moreover, it became apparent that keyword search dominates the ranking.

Search Results Summary

The Architecture Shift: “Production-Ready”

With a functioning hybrid search in place, I started thinking bigger: what would a production-ready version look like? This is where I decided to experiment with AI-assisted code development in the form of “Vibe Coding.”

The “Vibe Coding” Experiment

My tool of choice — Claude Code.

Instead of using Claude Code just for writing code, I treated it as many have advised as a technical conversation partner. We would dive into brainstorming sessions, in the best case, it would look like this: I would describe what I wanted, what I expected and Claude would suggest approaches, provide recommendations, identify potential gaps or issues in the plan and help refine solutions.

The process of understanding how to work with Claude was iterative and required to overcome a significant learning curve. Several key learning I found out the hard way were:

  • Ask Claude to be critical of your ideas and suggestions
  • Be critical of Claude’s suggestions or recommendations
  • Do not expect Claude to understand the ‘entire picture’, have consistent check points to see what it understands
  • Use Plan Mode to devise a blue print first, iterate second, agree third, then execute the plan. Use a a continuous feedback loop. Otherwise, the result might not be what you expect or worse, Claude might overwrite your existing code
  • Save your work at every step through version control
  • Validation of work , ensure comprehensive testing is carried out before deployment

For example, I was not yet familiar with Plan Mode , a built-in mode that prohibits any edits to be made in the code. I would ask Claude something or make a statement and return to find a code repository with multiple edits and errors. This was a hard lesson and one that taught me to always plan first, validate the plan, test often and use version control.

While there was a significant productivity boost, Vibe Coding came with its own challenges and bumps in the road: applications that would not start, running code that was overwritten and recommendations that did not meet expectations. While Agentic coding tools can accelerate development , in my experience, you need to stay engaged, critical of it and at the moment it might be as effective as the human at the end of the keyboard.

FastAPI: Separating Concerns

The first major architectural change was to separate the application into a frontend and a backend. At the time, I had a RAG System running behind a Streamlit app to improve interactivity. Instead of a monolithic application, I would then have:

  • Pydantic schemas enforced data contracts and validation
  • Dedicated endpoints for all operations
  • Comprehensive error handling
  • Streamlit would be a lightweight UI client that calls the backend service

Redis: The Performance Game-Changer

I had become frustrated with the response time of queries that I would rerun and asked Claude for help. It suggested to deploy Redis and that implementation delivered three key benefits:

  • Cost optimization: mitigate redundant embedding API calls
  • Latency reduction: sped up responses from seconds to milliseconds
  • Better user experience: response times of repeated queries were almost instantaneous

The architecture now maintained separation of concerns — the frontend was a lightweight wrapper and the backend service handled all the logic and processing. Pgvectorscale handled document embeddings and the document metadata, whilst Redis managed query and response caching to enhance user experience. Ultimately the addition of these components improved maintainability, testing, error handling, stability and performance of the system.

Architecture Diagram

```
─────────────────────────────────────────────────────────┐
│                    Frontend Layer                       │
│  ┌─────────────────────────────────────────────────┐    │
│  │        Streamlit UI (frontend/)                 │    │
│  │  • User interface & query input                 │    │
│  │  • Search results visualization                 │    │
│  │  • System status dashboard                      │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────┬───────────────────────────────┘
                          │ HTTP Requests
                          ▼
┌─────────────────────────────────────────────────────────┐
│                     Backend Layer                       │
│  ┌─────────────────────────────────────────────────┐    │
│  │         FastAPI Service (backend/)              │    │
│  │  • API endpoints                                │    │
│  │  • Request validation                           │    |                                               │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────┬───────────────────────────────┘
                          │ Business Logic Calls
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Core Business Logic                   │
│  ┌─────────────────┐ ┌─────────────────┐ ┌───────────┐  │
│  │   Processors    │ │     Search      │ │ Services  │  │
│  │ • Document      │ │ • Vector        │ │ • Cache   │  │
│  │   processing    │ │ • BM25          │ │ • LLM     │  │
│  │ • PDF parsing   │ │ •RRF Hybrid.    │ │ • Synthesis│ │
│  │ • Chunking      │ │                 │ │ • Utils   │  │
│  └─────────────────┘ └─────────────────┘ └───────────┘  │
└─────────────────────────┬───────────────────────────────┘
                          │ Search Results + RRF Scores
                          ▼
┌───────────────────────────────────────────────────────────┐
│                  Infrastructure Layer                     │
│  ┌─────────────────────────────────┐ ┌─────────────────┐  │
│  │        PGvectorscale            │ │     Redis       │  │
│  │ • Vector operations             │ │ • Query cache   │  │
│  │ • Time-series partitioning      │ │ • Session mgmt  │  │
│  │ • Similarity search             │ |                 │  │
│  │ • Document storage              │ │                 │  │
│  └─────────────────────────────────┘ └─────────────────┘  │
└───────────────────────────────────────────────────────────┘
                          │ External API Calls
                          ▼
┌───────────────────────────────────────────────────────────┐
│                   External Services                       │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                   OpenAI API                        │  │
│  │ • text-embedding-3-small (embeddings)               │  │
│  │ • gpt-4o (LLM responses)                            │  │
│  └─────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────┘
```

Key Learnings

Here are some things I’ve learned along the way:

PostgreSQL could double as a vector DB. with Timescales’ solution, you might not need a specialized store after all.

Indexing matters. efficient ANN indexes like DiskANN keep searches fast and scalable.

Hybrid search trickier than it looks. fusing search engine results requires more than intuition.

Plan for resilience. design for graceful failure (e.g keep running even if Redis is down).

“Tame” AI. limit LLM calls, allow AI to solve single components within the application rather than letting it figure it out on their own, that could be dangerous and expensive.

AI-assisted coding accelerates development, but doesn’t replace you. treat AI as a partner, not replacement.

The Verdict

Did I replicate Palantir’s RAG system? No, that was never the goal. My aim was to understand how these systems really work, to learn how to tame LLMs for better outputs, and to practice building agents with Pythonic frameworks. Along the way, I realized that AI applications today should be treated as software first. Use solid engineering practices, and call on LLMs only when they add value; ideally for focused tasks where code alone cannot deliver consistent results.

Palantir’s Foundry and Ontology reflects decades of work and polish. Platforms like that are autobahns, they get you to your destination quickly and smoothly. Building from scratch is the opposite: it’s hiking a trail. It takes muscle, it is messy but it rewards you with a deeper connection to the terrain. And once you’ve walked the trail, you come away with skills and insights that make you more capable wherever or whatever route you go.