Embedding Model Selection for Citation-Grounded Retrieval over European Union Financial and AI Regulation

Abstract

We report a controlled empirical comparison of five text-embedding models for first-stage semantic retrieval over a corpus of consolidated European Union legislation governing crypto-assets, anti-money-laundering, digital operational resilience, and artificial intelligence. The corpus comprises seven legislative instruments segmented into 56,413 citable provision-level units across eight official languages. Five models were evaluated under identical conditions: two general-purpose proprietary models (OpenAI text-embedding-3-large; Google gemini-embedding-001), two legal-domain-specialised models (Isaacus Kanon 2; Voyage voyage-law-2), and one general-purpose model from a legal-adjacent vendor (Voyage voyage-3-large). Each model embedded the full corpus once and was queried over an approximate-nearest-neighbour index using cosine similarity. We constructed four evaluation rounds of increasing difficulty, totalling 159 questions with human-authored gold provision labels, and measured hit@5, hit@10, recall@10, and mean reciprocal rank (MRR).

A fourth round additionally probed cross-statute disambiguation, near-duplicate-article discrimination, multi-provision compound queries, and out-of-scope (unanswerable) questions, the last scored by the similarity margin between answerable and unanswerable queries as a proxy for hallucination resistance.

Two findings are notable. First, domain-specialised legal embedders did not outperform general-purpose models on this statutory corpus; the strongest single model overall was a general-purpose model (voyage-3-large), and a general-purpose multilingual model (gemini-embedding-001) was strongest on colloquially phrased queries. Second, the two strongest retrievers exhibited the smallest separation between answerable and unanswerable queries, indicating that retrieval quality and intrinsic abstention capability are not aligned and must be handled by distinct system components. We discuss implications for embedding selection in high-precision legal retrieval.

Introduction

Retrieval-augmented question answering over primary legal sources places unusually strict demands on the retrieval stage. In a citation-grounded system, every assertion shown to a user must be traceable to a specific provision of a specific instrument, and the cost of surfacing the wrong provision, or a provision from the wrong instrument, is materially higher than in general web or enterprise search. The first-stage retriever, typically a dense embedding model paired with an approximate-nearest-neighbour (ANN) index, therefore determines the ceiling on end-to-end precision: a provision that is not recalled into the candidate set cannot be cited, reranked, or verified downstream.

This report studies which embedding model best serves that first stage for a corpus of European Union financial-services and artificial-intelligence regulation. The question is practically consequential because the available models differ along several axes that plausibly matter for legal text: general-purpose versus legal-domain specialisation, embedding dimensionality (and therefore storage and search cost), and whether the model distinguishes query encodings from document encodings (asymmetric retrieval). We evaluate five models spanning these axes under a single, fixed protocol, and we deliberately escalate task difficulty across four rounds so that differences invisible on easy, keyword-aligned questions become measurable on realistic and adversarial ones.

Corpus

The corpus consists of the consolidated texts of seven European Union legislative instruments, retrieved from the official EU publications repository and parsed from the canonical machine-readable legal format into a hierarchical structure of provision-level units (articles, paragraphs, points, and recitals). Each instrument is held in eight official-language editions (English, German, French, Italian, Spanish, Dutch, Polish, and Lithuanian). After parsing and normalisation, the corpus contains 56,413 citable units; retrieval and scoring in this study were conducted against the English edition except where a round explicitly states otherwise.

Methods

Models under test

Five embedding models were evaluated. Two are general-purpose proprietary models; two are explicitly legal-domain-specialised; one (voyage-3-large) is a general-purpose model from a vendor that also publishes a legal model, included to separate the effect of vendor from the effect of domain specialisation. Embeddings were stored at each model's native or default output dimensionality. Models offering asymmetric encoding were queried with the document transformation for corpus text and the query transformation for questions; the symmetric model used a single transformation for both.

Embedding and indexing protocol

Each model embedded all 56,413 citable units exactly once. Provision text was truncated to a fixed character budget well within every model's context limit, ensuring identical input boundaries across models. Vectors were written to a relational store and indexed with HNSW under cosine distance, one index per model. Embedding was idempotent and resumable: each unit was embedded once per model and re-runs were no-ops, eliminating duplication. The procedure embedded 56,413 vectors per model with full coverage and no missing units.

Retrieval and scoring

For each question, the query was embedded with the model's query transformation and the top k = 10 nearest provisions were retrieved by cosine similarity. A retrieved unit was counted as correct when both its instrument and its article number matched an expected (instrument, article) pair in the gold label. We report four metrics: hit@5 and hit@10 (the fraction of questions with at least one correct provision in the top 5 and top 10 respectively), recall@10 (the fraction of all expected provisions retrieved within the top 10), and mean reciprocal rank (MRR, the mean of the reciprocal of the rank of the first correct provision, which rewards ranking the correct provision higher). Scoring was fully automatic against fixed gold labels; no human judgement entered the per-query scoring.

Gold question suite

We authored four rounds of questions with human-assigned gold provision labels, drawn from the instruments' article headings and verified against the corpus. Round 1 (R1, 25 questions) covers core obligations with mostly keyword-aligned phrasing. Round 2 (R2, 57 questions) extends coverage across all seven instruments at article level. Round 3 (R3, 45 questions) rephrases realistic questions in colloquial, scenario-based language that deliberately avoids the vocabulary of the article headings, to test robustness to natural user phrasing. Round 4 (R4, 32 questions) is a stress round with four categories: cross-statute disambiguation (a topic that recurs across many instruments, pinned to one), near-duplicate discrimination (a near-twin article exists; the question targets the less obvious one), compound questions (requiring two or more provisions or instruments simultaneously), and out-of-scope negatives (questions with no answer anywhere in the corpus). Negative questions are scored not by hit but by the top-1 cosine similarity they elicit; a model that resists returning a confidently similar but irrelevant provision is preferable.

Results

Rounds 1–3: coverage and phrasing robustness

Tables 3–5 report the five models on R1, R2, and R3. On the keyword-aligned and multi-instrument rounds (R1, R2) the general-purpose model voyage-3-large is strongest on every metric, with gemini-embedding-001 and text-embedding-3-large close behind. On the colloquial round (R3) all models degrade substantially, but the degradation is uneven: gemini-embedding-001 is the most robust to natural phrasing (hit@10 0.933, recall@10 0.852, MRR 0.658), while text-embedding-3-large, the leader on the easy round, falls furthest. The legal-specialised models (Kanon 2, voyage-law-2) do not lead any of these rounds.

Round 4: cross-statute, near-duplicate, and compound retrieval

Tables 6–8 report the three answerable categories of R4. On cross-statute disambiguation and near-duplicate discrimination, voyage-3-large attains the highest MRR by a clear margin (0.788 and 0.917 respectively), indicating it not only retrieves the correct provision but ranks it first most often even when a near-identical provision from another instrument competes. gemini-embedding-001 matches it on hit and recall but ranks slightly lower. On compound questions, which require multiple provisions, voyage-3-large achieves the highest recall@10 (0.617); voyage-law-2 is markedly the weakest (recall@10 0.250), failing to assemble multi-provision answers.

Round 4: unanswerable queries and abstention margin

Table 9 reports behaviour on out-of-scope negative questions (n = 8) alongside the mean top-1 similarity on answerable questions, and the margin between them. The margin is the calibration-free indicator of whether a model can distinguish answerable from unanswerable queries by similarity alone: a larger margin means the score of the top result is more informative about whether any relevant provision exists. The result is the inverse of retrieval quality. text-embedding-3-large, a mid-ranked retriever, has the largest margin (0.282) and assigns low similarity (0.331) to unanswerable questions. gemini-embedding-001, among the strongest retrievers, has the smallest margin (0.115) and assigns high similarity (0.658) even to questions with no answer in the corpus; voyage-3-large is second-smallest (0.137). Absolute cosine values are not comparable across models, but the within-model margin is, and on that measure the two best retrievers separate answerable from unanswerable queries least well.

Cross-round synthesis

Table 10 summarises MRR across the four rounds (the R4 figure is the mean over its three answerable categories). voyage-3-large leads three of four rounds; gemini-embedding-001 leads the colloquial round and is otherwise second. The two legal-specialised models trail throughout, with voyage-law-2 lowest overall. The ordering is stable enough across rounds, and the margins on the harder rounds large enough, to support a clear ranking of the field on this corpus.

Discussion

Domain specialisation did not pay off. Both legal-specialised models underperformed general-purpose models on every round, and the legal model in the strongest vendor's line-up (voyage-law-2) was the weakest of the five overall and conspicuously poor on compound questions. The most plausible explanation is a domain-shift mismatch: legal embedders are predominantly trained on case law and contractual prose, whereas consolidated EU regulation is highly structured statutory text with explicit, consistent terminology. General-purpose models, trained on far broader corpora, appear to cover this register at least as well, and their stronger handling of paraphrase helps most where it matters, on naturally phrased questions.

Phrasing is the dominant difficulty axis. Every model lost substantial accuracy moving from keyword-aligned (R1) to colloquial (R3) questions, confirming that benchmark performance on heading-aligned queries overstates real-world performance. The model that degraded least under paraphrase (gemini-embedding-001) is not the model that scored highest on clean queries (voyage-3-large), so the choice depends on the expected distribution of user phrasing.

Retrieval quality and abstention capability are not aligned. The clearest novel result is the inverse relationship in Table 9: the two strongest retrievers are the least able to signal, by similarity alone, that a question has no answer in the corpus. For a citation-grounded system this is consequential, because confidently returning an irrelevant provision for an out-of-scope question is a hallucination risk. The practical implication is that abstention must not be delegated to the first-stage embedding score; it belongs to a downstream relevance or verification component. Embedding selection should therefore optimise recall of the correct provision into the candidate set, and treat intrinsic separability as a secondary, non-decisive signal.

Cost and efficiency favour lower-dimensional vectors. The strongest overall model stores 1024-dimensional vectors, one third the size of the 3072-dimensional general-purpose alternative, which reduces index memory and nearest-neighbour search cost proportionally without a retrieval-quality penalty on this corpus. Where storage or latency is the binding constraint, this is a decisive secondary advantage.

Limitations

Sample size. The four rounds total 159 questions; per-category counts in Round 4 are small (5–11), so category-level differences should be read as indicative rather than definitive.
Single-language scoring. Retrieval was scored on the English edition. Cross-lingual retrieval, although supported by the corpus, is not reported here and may reorder the field, particularly for models with weaker multilingual coverage.
First-stage only. We measure first-stage dense retrieval in isolation. A downstream reranking and verification stage, present in the production system, would alter end-to-end behaviour and is expected to compress some of the differences observed here.
Point-in-time models. Vendor models evolve; results pertain to the model versions available at the time of evaluation and to the consolidated texts as retrieved.

Conclusion

Across four rounds of increasing difficulty over a 56,413-unit corpus of EU financial and AI regulation, general-purpose embedding models matched or exceeded legal-domain-specialised models for provision-level retrieval. voyage-3-large was the most consistent and the strongest overall, leading three of four rounds and offering a threefold reduction in vector size; gemini-embedding-001 was the most robust to colloquial phrasing. Crucially, the strongest retrievers were the weakest at distinguishing answerable from unanswerable queries by similarity alone, so a high-precision legal system should select its embedding model to maximise recall of the correct provision and place the abstention decision in a separate downstream component. For practitioners building retrieval over civil-law statutory corpora, the broader lesson is that domain specialisation is an empirical question rather than a default, and that intrinsic abstention behaviour deserves explicit measurement alongside conventional retrieval metrics.

Embedding Model Selection for Citation-Grounded Retrieval over European Union Financial and AI Regulation

Abstract

Introduction

Corpus

Methods

Models under test

Embedding and indexing protocol

Retrieval and scoring

Gold question suite

Results

Discussion

Limitations

Conclusion

More from the journal

FATF Publishes Targeted Report on DeFi Regulatory Challenges, 21 July 2026

EU Digital Omnibus Regulation 2026/1744 Enters Into Force, Extending AI Act Compliance Timelines

EU AI Act Article 50 Transparency Obligations Apply from 2 August 2026

Ready to launch without the regulatory guesswork?

Try Licentium AI

Browse the Fintech Licensing Hub

Talk to us