Reranking
Cross-encoder reranking is the single biggest precision lever in RAG — 10-25% NDCG improvement; Cohere Rerank v3.5 (API), Jina v2 (137M self-hosted), BGE v2-m3 (568M highest quality open).
A second-pass scoring step applied after initial retrieval. The retriever finds the top-k candidates fast; the reranker scores them accurately. This two-stage pattern consistently delivers 10-25% NDCG improvement over retrieval alone.
Why Reranking Exists
Vector search is approximate. Embedding models compress a chunk into 768 or 1536 numbers. Semantic meaning survives, but fine-grained relevance detail doesn't. A bi-encoder embeds query and document independently, which is fast but lossy.
A cross-encoder sees the query and document together, computing full attention across both. It can catch:
- Exact keyword matches the embedding missed
- Negations ("does NOT support X")
- Query-specific relevance that looks irrelevant in isolation
The cost: cross-encoders are ~100x slower than bi-encoders. So you never use them for first-pass retrieval over millions of chunks. Only to rerank 20-100 candidates.
Cohere Rerank
The default production choice. API-hosted, no GPU required.
import cohere
co = cohere.Client("COHERE_API_KEY")
query = "What is the capital gains tax rate for 2025?"
documents = [chunk["text"] for chunk in top_100_chunks]
response = co.rerank(
model="rerank-v3.5",
query=query,
documents=documents,
top_n=5,
return_documents=True,
)
for result in response.results:
print(f"Score: {result.relevance_score:.4f}")
print(f"Text: {result.document.text[:200]}")rerank-v3.5 (released 2024) supports multilingual reranking and semi-structured data (JSON, tables, code). Pass raw JSON objects as documents — it handles structure natively.
Pricing: ~$2 per 1,000 searches (each search = query + N documents scored).
Jina Reranker
Open-weights alternative. Run locally or via API.
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained(
"jinaai/jina-reranker-v2-base-multilingual",
torch_dtype=torch.float16,
trust_remote_code=True,
)
model.eval()
pairs = [[query, doc] for doc in documents]
scores = model.compute_score(pairs, max_length=1024)
ranked = sorted(zip(scores, documents), reverse=True)jina-reranker-v2-base-multilingual (137M params) runs on a single RTX 3080 at ~200 pairs/second. Suitable for on-prem deployments where Cohere API is not an option.
BGE Reranker (BAAI)
Strong open-weights option, especially for Chinese + English:
from FlagEmbedding import FlagReranker
reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)
scores = reranker.compute_score([[query, doc] for doc in documents])bge-reranker-v2-m3 has SOTA performance on BEIR at 568M params. Slower than Jina but higher quality.
LangChain Integration
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
compressor = CohereRerank(model="rerank-v3.5", top_n=5)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50})
)
docs = compression_retriever.invoke("What is the capital gains tax rate?")The base retriever fetches 50 chunks; Cohere reranks and returns the top 5.
LlamaIndex Integration
from llama_index.postprocessor.cohere_rerank import CohereRerank
reranker = CohereRerank(api_key="...", top_n=5)
query_engine = index.as_query_engine(
similarity_top_k=50,
node_postprocessors=[reranker],
)Retrieval Numbers
Typical improvement from adding a reranker on top of dense retrieval:
| Stage | NDCG@10 |
|---|---|
| BM25 only | ~0.45 |
| Dense only (text-embedding-3-large) | ~0.55 |
| Dense + Cohere rerank-v3.5 | ~0.65–0.70 |
| Hybrid (BM25+dense) + rerank | ~0.70–0.75 |
Source: BEIR benchmark; exact numbers vary by domain.
When to Skip Reranking
- Latency budget is <200ms end-to-end (reranking adds 100-500ms)
- Corpus is small (<1,000 chunks) — the bi-encoder is accurate enough
- Queries are keyword-heavy and exact-match retrieval works fine
For everything else (especially financial docs, legal, medical, code search) add a reranker.
Reranker vs. Larger Retrieval k
A common alternative: just retrieve top 20 instead of top 5. This does not replicate reranking. The 20th bi-encoder result is often irrelevant; a reranker on 50 results and returning 5 gives you better 5 results than just taking the top-5 bi-encoder results with k=5.
Key Facts
- Cross-encoder vs bi-encoder: cross-encoder sees query+document together (full attention); bi-encoder embeds independently (~100x faster but lossy)
- Cohere rerank-v3.5: ~$2 per 1,000 searches; multilingual; handles JSON/tables/code natively
- Jina reranker-v2-base-multilingual: 137M params; ~200 pairs/second on RTX 3080
- BGE reranker-v2-m3: 568M params; SOTA on BEIR; best open quality
- NDCG@10 progression: BM25 ~0.45 → dense ~0.55 → dense+rerank ~0.65-0.70 → hybrid+rerank ~0.70-0.75
- Latency: reranking adds 100-500ms; skip if latency budget is <200ms
- More k ≠ reranking: top-20 bi-encoder results are not equivalent to top-5 after reranking 50 candidates
Common Failure Cases
Reranker adds 1-2 seconds of latency and breaks the SLA
Why: a remote reranker (Cohere API) adds one full API round-trip; at 50 candidates this can be 500-1500ms.
Detect: trace p95 latency; if reranking span exceeds 500ms consistently, it's a bottleneck.
Fix: reduce first-pass k from 50 to 20; use a self-hosted Jina reranker for sub-100ms latency on short documents.
BGE reranker OOMs on long documents
Why: bge-reranker-v2-m3 (568M params) runs full cross-attention on [query + doc] pairs; 1K-token docs at batch size 16 exceed 24GB VRAM.
Detect: CUDA out of memory error in the reranker; happens when document chunks are longer than 512 tokens.
Fix: set max_length=512 and truncate documents; or reduce the batch size; or switch to Jina (137M, lighter).
Cohere rerank-v3.5 returns lower scores than expected for technical content
Why: the reranker's general training may not align well with very domain-specific terminology (proprietary product names, internal code names).
Detect: manual inspection shows clearly relevant documents scored <0.2 by the reranker.
Fix: use a self-hosted domain-fine-tuned cross-encoder; or supplement with keyword boosting for domain terms.
Reranking the wrong documents due to missing first-pass k tuning
Why: first-pass retrieval returns only k=5; the reranker can only re-order those 5; if the relevant document is at rank 6, reranking can't help.
Detect: RAGAS context recall is low despite high context precision after reranking.
Fix: increase first-pass k to 20-50 before reranking; recall must be established in the first pass.
JSON/table documents not scored correctly by text-only rerankers
Why: older rerankers serialise JSON to plain text, destroying structure signals; a Jina v1 model may score a table lower than a prose duplicate.
Detect: compare scores for identical content in JSON vs prose format; >20% difference indicates structure sensitivity.
Fix: use Cohere rerank-v3.5 which handles semi-structured data natively; or serialise structured content to a consistent readable format before reranking.
Connections
- rag/pipeline — where reranking fits in the full pipeline
- rag/hybrid-retrieval — hybrid retrieval provides the best first-pass candidates to rerank
- rag/embeddings — the bi-encoder stage that precedes reranking
Open Questions
- Does Cohere rerank-v3.5's native JSON/table handling actually improve reranking quality vs text-only for structured document corpora?
- Is 50 the right first-pass k value, or does the optimal first-pass size vary by domain?
- Can self-hosted rerankers (Jina, BGE) match Cohere's multilingual quality for European languages?