We are building a system that answers questions about EU financial and AI regulation and backs every statement with the exact article of the actual law. The part that decides whether that works is retrieval: if the right provision never surfaces, nothing downstream can save the answer. While choosing our retrieval stack we ran a controlled test on a real corpus, and one result was clear enough that it is worth sharing, with all the caveats attached.
The short version: adding a reranker was the single largest accuracy gain in the pipeline, and it cost almost nothing to add.
What a reranker does
Modern semantic search has two stages. First, an embedding model casts a wide net and pulls back a shortlist of candidate provisions by meaning, fast but approximate. Second, a reranker re-reads each shortlisted provision directly against the question and reorders them. The embedder is good at finding the right article somewhere in the top fifty; the reranker is good at moving it to the top. For a citation-grounded answer, the top is what matters, because the answer model should be reading the most relevant provision first.
How we measured
The corpus is the consolidated text of seven EU instruments (MiCA, DORA, the Transfer of Funds Regulation, the AML Regulation, AML Directive 6, the AI Act, and the AMLA Regulation), split into roughly 56,000 article and paragraph level units. We wrote 151 questions across four difficulty bands, from clean keyword-style queries to messy, colloquial ones, each labelled with the article that should be retrieved. We retrieved a 50-candidate pool with a single embedder, then reordered it with each reranker, and scored with Mean Reciprocal Rank (MRR): the higher the correct article ranks, the higher the score, where 1.0 means it was first every time.
The result
Mean MRR, by question band, with no reranker versus three rerankers, on our corpus:

On the hard, colloquially phrased questions, the kind real users actually ask, reranking raised MRR from 0.55 to 0.70 and the share of questions with a correct article in the top five from 76% to 87%. That is a large gain from a component that adds milliseconds and a fraction of a cent per query.
On vendors, and stated only for our corpus and our questions: Voyage rerank-2.5 led every band, which is why we adopted it. Cohere rerank-v3.5 was close to flat here, even with full-length documents and its token budget set fairly. We did not see it help on this material, which surprised us, and we would not generalise that beyond this corpus.
A note on legal-specialist models
We included Kanon 2 Reranker from Isaacus, a reranker built specifically for legal text, because it is exactly the kind of tool you would expect to win here, and because its published benchmark reports an advantage over Voyage. On our corpus it came second or third in every band and helped only on the hard set. This is not a refutation of anyone's benchmark: Isaacus reports results on a different dataset of mixed legal material, and benchmarks measure what they measure. What we can say is narrower and, we think, more useful: on consolidated EU financial and AI regulation, with our questions and a general-purpose first-stage retriever, a strong general reranker outperformed the domain-specialist one. We saw the same pattern earlier with embedding models. For clean, well-structured statutory text, domain specialisation is worth testing rather than assuming.
Limitations
These numbers are ours, not a public benchmark. The 151 questions were written and labelled by us, not adjudicated by independent counsel; the test used one candidate-pool size and one first-stage retriever; and the field of rerankers is larger than three. Treat the vendor ordering as what worked in our setup, and the rerank lift as the durable, generalisable finding.
Takeaway
If you are building retrieval over statutory or regulatory text and you have not added a reranker, that is probably the highest-return change available to you, especially for the natural-language questions real users type. Measure it on your own corpus, with your own questions, before you trust anyone's leaderboard, including ours.
