When BM25 Beats Your Embedding Model: Hybrid Retrieval in the Wild

The default story for enterprise RAG goes like this: chunk your documents, embed them with text-embedding-3-large or bge-large or whatever this quarter’s leader is, store the vectors, retrieve by cosine similarity, done. The benchmarks support it. The blog posts confirm it. The vendor demos are unambiguous.

In production, on enterprise corpora, this approach loses to a thoughtful sparse-plus-rerank pipeline more often than the literature would suggest. Three cases from the last year, and what the fix looked like.

Case 1: A bank’s policy library

The corpus was the bank’s internal policy and procedure manuals — 8,000 documents, 200,000 chunks, heavy on regulatory citations, internal codes, product names, and circulars. The retrieval task was answering operations-officer questions like “what is the current limit on cash deposits without PAN under circular 17/2023?”

Pure dense retrieval missed the right chunk 31% of the time at k=5. The failures clustered around queries with specific identifiers — circular numbers, product codes, regulator-defined terms. The embedding model treated “circular 17/2023” as roughly similar to “circular 2023/17” and several other circulars from the same period.

Fix. BM25 over the same chunks, lexically aware of identifiers, recovered the right chunk in 96% of the cases where the dense retrieval had missed. We deployed hybrid retrieval (RRF fusion of dense top-50 and sparse top-50) with a cross-encoder reranker on top. End-to-end retrieval recall at k=5 went from 69% to 91%.

The lesson here is mundane: when your corpus contains identifiers that humans care about, lexical matching is not optional. Embeddings smooth over the exact thing the user is asking for.

Case 2: A manufacturer’s service manuals

Corpus: equipment service manuals across three product lines, multiple model years. Chunks contained dense technical procedures with part numbers, torque specs, sequence steps. The retrieval task was answering field-technician queries — “what is the disassembly sequence for the PXR-450 pump head?”

Dense retrieval, again, struggled with the identifier — PXR-450 was lexically distinctive but semantically indistinguishable from PXR-455 and PXR-440, both of which appeared in the corpus with overlapping procedures. The result was answers stitched from the wrong model’s manual.

Fix. Metadata filtering before retrieval. Every chunk carried a structured model_id field; the query was parsed to extract the identifier; retrieval filtered to chunks matching the model before either dense or sparse retrieval ran. Recall went from 58% to 94%.

The lesson: when entity identity matters, encode it as metadata, filter on it explicitly, and stop hoping the embedding model has learned it. It has not learned what you think it has learned.

Case 3: A logistics firm’s customs handbook

Corpus: customs tariff schedules, ruling letters, and the firm’s internal HS-code classification guidance — 50,000 chunks, heavy on numeric codes, product descriptions, exceptions and overrides. The retrieval task was answering a classification analyst’s queries on edge-case HS codes.

This one is the most interesting because pure BM25 also failed. The analysts asked questions in product language (“frozen pre-cooked chicken nuggets with breading”), and the relevant rulings were written in regulatory language (“prepared poultry products, breaded, of subheading 1602.32.10”). BM25 missed because the surface form was wrong. Dense missed because the model did not know the regulatory ontology well enough.

Fix. Query rewriting. A small upstream model translated the analyst’s query into the regulatory register, then both queries (original and rewritten) were used for hybrid retrieval, with the union reranked. Recall went from 47% (BM25 alone) and 51% (dense alone) to 88% (rewrite + hybrid + rerank).

The lesson: vocabulary mismatch is a retrieval problem, not a generation problem. Solve it at the query layer, not by hoping the LLM compensates in the final answer.

What we generally ship now

Our default enterprise RAG configuration looks like this, and we deviate only with cause:

Hybrid retrieval. BM25 + dense, RRF fusion. Both rank lists are ~50 deep before fusion.
Metadata filters. Mandatory where entity identity matters (product, jurisdiction, customer, time window).
Query rewriting. A small model expands or translates the query when the corpus vocabulary differs from user vocabulary. Cheap and often decisive.
Cross-encoder rerank. Reranks the fused top-50 down to k=5 for the generator. Adds latency but usually pays back in answer quality.
Independent retrieval evaluation. Recall@5, MRR, faithfulness scored against a labelled set. Run on every change. Hold the line.

Why the default story persists

The benchmarks the default story rests on are typically built on Wikipedia, MS MARCO, or BEIR — corpora with general-domain queries answered by chunks rich in semantic content. Enterprise corpora are not like that. They are heavy on identifiers, on regulatory and technical vocabulary, and on entity disambiguation. The benchmark generalises poorly.

The other reason is commercial. Vector database vendors have a strong interest in the default story. Sparse retrieval has no vendor cheering for it. The pgvector-on-Postgres-with-BM25 architecture does not get conference keynotes.

If your enterprise RAG is underperforming, before you fine-tune the embedding model, try the boring fixes. Hybrid retrieval, metadata filters, query rewriting, reranking. Most production retrieval wins are unglamorous.

If you would like a retrieval evaluation against your own corpus, we run them as a fixed-scope engagement.

Field Note