Retrieval Systems
Retrieval that puts the right evidence in front of a model — full-text shortlists, vector search where it earns its keep, and hard filters that keep answers grounded.
We build retrieval systems — the layer that decides which evidence a language model actually reads before it answers. Good retrieval augmented generation is not a vector database and a prompt; it is a pipeline of shortlisting, filtering, and ranking that puts the right passages in front of the model and keeps the wrong ones out. We design enterprise RAG that knows when semantic search helps, when a full-text index is sharper, and when the honest answer is to skip retrieval and pass the whole document.
When vector search earns its place — and when it doesn't
The default assumption is that every RAG system needs a vector database. Often it doesn't. When your corpus is bounded — a few standards, a policy set, a product manual — the strongest design inlines the full text and leans on prompt caching so the cost is flat per session rather than per query. Vector search pays off when the corpus is large, the vocabulary is messy, and users phrase questions nothing like the source. We start by measuring the corpus, not by reaching for an index. The result is a retrieval layer matched to the data: semantic search where meaning drifts from wording, exact-match full-text where the language is already precise, and no index at all where a model can simply read everything.
Full-text shortlists with a model reading the long form
A pattern that consistently outperforms naive vector search: use a fast full-text index to narrow thousands of passages down to a handful of candidates, then let the language model read those candidates in full and decide. The shortlist is cheap and high-recall; the model does the precise judgment on a small set. We build this with generated full-text columns and GIN indexes inside the database you already run — no separate vector store, no extra infrastructure. When the upstream data is enriched into a consistent vocabulary before retrieval, keyword matching becomes high-signal, and the model bridges any remaining semantic gap by reading the shortlisted text rather than trusting a similarity score it can't explain.
Hard filters before similarity — the part that keeps RAG correct
Most retrieval failures are not ranking failures; they are filtering failures. The system retrieves a passage that is semantically close but contextually wrong — the right clause from the wrong contract, the right rule from a version that no longer applies, another tenant's record entirely. We treat filtering as a first-class layer: tenant isolation, effective-date and jurisdiction routing, document-type and status constraints, all applied before vector search or full-text scoring ever runs. Similarity decides ranking within an already-correct candidate set, never membership in it. This is what separates a demo from enterprise RAG — the retrieval layer enforces who can see what and which version is authoritative, so the model never reasons over evidence it should never have seen.
Retrieving structure, not just text
Retrieval is not only about prose. Some of the most useful evidence is structured — call chains through a codebase, rows from a schema, the symbols and routes that make up a feature. We build retrieval surfaces over structured data so an agent can ask precise questions: list the routes under this prefix, fetch this chain root-to-terminal, read this symbol's body, get this table's columns. Exposed through a tool interface, this lets a model explore a graph of facts instead of guessing from a blob of concatenated text. The same discipline applies — narrow first, then read the exact thing — but the unit of retrieval is a typed record with provenance, so every claim the model makes traces back to a specific identifier in the source.
Cost, caching, and graduated retrieval
Retrieval design is a cost problem as much as a quality one. We use prompt caching so stable context — a standard, a schema, a system prompt — is paid for once and reused across queries, turning a large fixed corpus into a flat per-session cost. We design retrieval as a graduated path rather than a fixed architecture: start with full-text shortlists, add semantic vector search only where measurement shows it lifts quality, and add a reranking pass last, where precision genuinely demands it. Each step is justified by evals on real queries, not by what the architecture diagram is supposed to contain. The cheapest retrieval that meets the quality bar is the right one.
- Full-text search with generated tsvector columns and GIN indexes for fast, high-recall shortlisting
- Vector search and semantic search where the corpus is large and the vocabulary diverges from user phrasing
- Hard pre-filters for tenant isolation, effective dates, version, and document type — applied before scoring
- Prompt caching for bounded corpora so large context is paid for once and reused per session
- Retrieval over structured data — chains, symbols, schemas, routes — exposed through a typed tool interface
- Eval-driven graduated pipelines: shortlist, then semantic rerank only where it measurably helps
- Answers grounded in the right evidence, with every passage traceable back to its source record
- Retrieval cost that stays flat as query volume grows, instead of scaling with every call
- A retrieval layer matched to your data — no vector database you didn't need, no missing index you did
Use cases
A bounded regulatory corpus where exact wording matters. Full-text narrows to a few candidate paragraphs; the model reads them verbatim and cites the applicable one. Effective-date and jurisdiction filters ensure only the version in force is ever retrieved.
Semantic search across customer contracts and uploads where tenant isolation is non-negotiable. Hard filters scope every query to one tenant before similarity ranking runs, so no embedding can ever surface another customer's text.
An agent that reasons about a real codebase by retrieving call chains, symbol bodies, and route definitions through a tool interface — narrowing to the relevant cluster first, then reading exact records so each claim ties back to a specific identifier.
Common questions
Explore more capabilities
Data & Knowledge Graphs
We model your domain as an ontology, unify scattered records into one graph, and turn raw source — including your own code — into a queryable structure that downstream retrieval and agents can trust.
↗03 — CapabilityGrounding & Evaluation
We make language-model output trustworthy: grounded in real sources, checked claim by claim, and measured against a quality gate before anything ships.
↗05 — CapabilitySystems Architecture & Scale
We design scalable systems architecture that stays simple — stateless services, a database-backed job queue, and a migration path to many nodes that's a config change, not a rewrite.
↗Building something that needs this?
Tell us what you're working on. The first call is always free.