Retrieval (RAG)

Retrieval Systems

Retrieval that puts the right evidence in front of a model — full-text shortlists, vector search where it earns its keep, and hard filters that keep answers grounded.

QueryShortlistGroundedanswer

We build retrieval systems — the layer that decides which evidence a language model actually reads before it answers. Good retrieval augmented generation is not a vector database and a prompt; it is a pipeline of shortlisting, filtering, and ranking that puts the right passages in front of the model and keeps the wrong ones out. We design enterprise RAG that knows when semantic search helps, when a full-text index is sharper, and when the honest answer is to skip retrieval and pass the whole document.

When vector search earns its place — and when it doesn't

The default assumption is that every RAG system needs a vector database. Often it doesn't. When your corpus is bounded — a few standards, a policy set, a product manual — the strongest design inlines the full text and leans on prompt caching so the cost is flat per session rather than per query. Vector search pays off when the corpus is large, the vocabulary is messy, and users phrase questions nothing like the source. We start by measuring the corpus, not by reaching for an index. The result is a retrieval layer matched to the data: semantic search where meaning drifts from wording, exact-match full-text where the language is already precise, and no index at all where a model can simply read everything.

Full-text shortlists with a model reading the long form

A pattern that consistently outperforms naive vector search: use a fast full-text index to narrow thousands of passages down to a handful of candidates, then let the language model read those candidates in full and decide. The shortlist is cheap and high-recall; the model does the precise judgment on a small set. We build this with generated full-text columns and GIN indexes inside the database you already run — no separate vector store, no extra infrastructure. When the upstream data is enriched into a consistent vocabulary before retrieval, keyword matching becomes high-signal, and the model bridges any remaining semantic gap by reading the shortlisted text rather than trusting a similarity score it can't explain.

Hard filters before similarity — the part that keeps RAG correct

Most retrieval failures are not ranking failures; they are filtering failures. The system retrieves a passage that is semantically close but contextually wrong — the right clause from the wrong contract, the right rule from a version that no longer applies, another tenant's record entirely. We treat filtering as a first-class layer: tenant isolation, effective-date and jurisdiction routing, document-type and status constraints, all applied before vector search or full-text scoring ever runs. Similarity decides ranking within an already-correct candidate set, never membership in it. This is what separates a demo from enterprise RAG — the retrieval layer enforces who can see what and which version is authoritative, so the model never reasons over evidence it should never have seen.

Retrieving structure, not just text

Retrieval is not only about prose. Some of the most useful evidence is structured — call chains through a codebase, rows from a schema, the symbols and routes that make up a feature. We build retrieval surfaces over structured data so an agent can ask precise questions: list the routes under this prefix, fetch this chain root-to-terminal, read this symbol's body, get this table's columns. Exposed through a tool interface, this lets a model explore a graph of facts instead of guessing from a blob of concatenated text. The same discipline applies — narrow first, then read the exact thing — but the unit of retrieval is a typed record with provenance, so every claim the model makes traces back to a specific identifier in the source.

Cost, caching, and graduated retrieval

Retrieval design is a cost problem as much as a quality one. We use prompt caching so stable context — a standard, a schema, a system prompt — is paid for once and reused across queries, turning a large fixed corpus into a flat per-session cost. We design retrieval as a graduated path rather than a fixed architecture: start with full-text shortlists, add semantic vector search only where measurement shows it lifts quality, and add a reranking pass last, where precision genuinely demands it. Each step is justified by evals on real queries, not by what the architecture diagram is supposed to contain. The cheapest retrieval that meets the quality bar is the right one.

What this includes
  • Full-text search with generated tsvector columns and GIN indexes for fast, high-recall shortlisting
  • Vector search and semantic search where the corpus is large and the vocabulary diverges from user phrasing
  • Hard pre-filters for tenant isolation, effective dates, version, and document type — applied before scoring
  • Prompt caching for bounded corpora so large context is paid for once and reused per session
  • Retrieval over structured data — chains, symbols, schemas, routes — exposed through a typed tool interface
  • Eval-driven graduated pipelines: shortlist, then semantic rerank only where it measurably helps
What you get
  • Answers grounded in the right evidence, with every passage traceable back to its source record
  • Retrieval cost that stays flat as query volume grows, instead of scaling with every call
  • A retrieval layer matched to your data — no vector database you didn't need, no missing index you did
Where it fits

Use cases

Standards and policy lookup

A bounded regulatory corpus where exact wording matters. Full-text narrows to a few candidate paragraphs; the model reads them verbatim and cites the applicable one. Effective-date and jurisdiction filters ensure only the version in force is ever retrieved.

Multi-tenant document search

Semantic search across customer contracts and uploads where tenant isolation is non-negotiable. Hard filters scope every query to one tenant before similarity ranking runs, so no embedding can ever surface another customer's text.

Code-aware agents

An agent that reasons about a real codebase by retrieving call chains, symbol bodies, and route definitions through a tool interface — narrowing to the relevant cluster first, then reading exact records so each claim ties back to a specific identifier.

FAQ

Common questions

Not always. Vector search earns its place when the corpus is large and users phrase questions unlike the source. For bounded corpora, a full-text index plus prompt caching is often sharper and cheaper. We measure your data first and recommend the retrieval that fits it — sometimes that includes a vector database, sometimes it doesn't.

Full-text matches words and is precise when the vocabulary is already clean; semantic search matches meaning and helps when phrasing drifts from the source. We frequently combine them — a full-text shortlist for recall, then the model reads candidates in full. Vector search is added where evals show it lifts quality, not by default.

By treating filtering as a first-class layer. Tenant isolation, effective-date, version, and document-type constraints run before any similarity scoring, so the model only ever sees evidence it's allowed to see and the version that's authoritative. Similarity ranks within an already-correct set — it never decides membership.

Prompt caching pays for stable context once and reuses it across queries, turning a large fixed corpus into a flat per-session cost. We design retrieval as a graduated path — shortlist first, add semantic vector search and reranking only where measurement justifies them — so you run the cheapest pipeline that meets the quality bar.

Building something that needs this?

Tell us what you're working on. The first call is always free.

Start a projectAll capabilities