From public URL to official record: source-route inference in Horizon

Horizon is regulatory intelligence for fintech and crypto compliance. We watch the official sources across EU regulators, six themes, and country-level NCAs, and turn substantive changes into structured, plain-English alerts grounded in source quotes and extracted obligations, delivered via email, Slack, Telegram, webhook, or REST API. Here I explain how it works.

Official publications no longer live in one place. A government notice may appear as a web page, an API record, a feed entry, a PDF, a register row, a legal identifier, or a sitemap URL. A pasted URL points to one surface. It rarely proves that the surface is the best route to the official record.

Horizon treats the pasted URL as an observation, not as the source itself. The system asks a stricter question: which official route gives the cleanest record, the least page noise, the safest fetch path, and the strongest proof of change?

The scientific term for that task is source-route inference. A URL enters the system. The system tests the available routes. It then binds the retrieved object to a change record. The object may be a notice, filing, consultation, register entry, warning, law text, speech, data release, or document section.

The route hierarchy

Official publishers already expose machine routes. GOV.UK's search endpoint is public and rejects unknown or invalid parameters with HTTP 422. SEC's EDGAR APIs serve JSON without API keys and update filing data during the day. The EU Publications Office's Cellar exposes REST, SPARQL, RSS, and Atom around structured identifiers for EU publications. The FCA exposes a Register API and notes that the free service carries no uptime or issue-resolution promise. These facts show why an official-source system must search for route quality before it reads visible page text.

The route order is not a scraper trick. It is a way to respect the publisher's own structure. Native APIs come first when they exist and fit the task. Feeds come next because they expose item-level records. Sitemaps help find new or changed URLs. Registers and legal stores carry stable identifiers. Static HTML and rendered HTML remain fallback routes. PDFs need document-level handling, not a short text prefix.

Feeds are not all equal. RSS treats guid as optional and leaves its syntax to the feed source. Atom requires an atom:id, treats it as a permanent identifier, and compares it character by character. RFC 5005 also defines paged feeds, where one feed URL may not contain the whole logical feed. Horizon therefore treats feed identity as source-specific state, not as a universal hash rule.

Sitemaps solve a different problem. They list URLs and optional update dates. A sitemap index can show when a sitemap file changed, not when every page inside changed. The sitemap protocol also caps a sitemap at 50,000 URLs and 50 MB uncompressed. Horizon reads sitemap dates as retrieval hints. It still verifies page or document content before it records a change.

HTML feed discovery has a precise signal. The HTML standard says link rel="alternate" with type="application/rss+xml" or type="application/atom+xml" identifies feeds for discovery. Body links do not carry the same signal. Horizon uses that distinction to separate a publisher's machine path from ordinary page links.

HTTP also carries change signals. RFC 9110 defines entity tags and modification dates for conditional requests. If-None-Match lets a client ask whether a stored entity tag still matches. If-Modified-Since lets a client avoid transfer when the selected representation has not changed. Horizon keeps these signals beside its own content hashes because server validators and local hashes answer different questions.

The accepted-source ledger

A page fetch can succeed and still fail as retrieval. The server may return a cookie shell, a search shell, a localized view, a blocked-bot page, a stale cache, or a document list without the document body. A system that records only "HTTP 200" cannot tell these cases apart. Horizon separates transport state, extraction state, content state, and decision state.

That separation creates the second contribution: the accepted-source ledger. A retrieved body does not become the next baseline merely because it arrived. It must pass route checks, content checks, extraction checks, and decision checks. If extraction collapses or a model call fails, the system records the run but does not let the broken body replace the prior accepted record.

The accepted-source ledger changes the meaning of silence. No alert can mean that the official route did not change. It can also mean that robots rules blocked the path, the fetch failed, extraction looked unsafe, the PDF parser failed, the model refused the decision, or the change fell outside the user's selected scope. Horizon makes these states distinct.

Robots.txt belongs in the ledger, but it is not proof of authority. RFC 9309 says crawler rules are requests that crawlers honor, not access authorization. Horizon records and respects crawler policy. It still uses separate checks for source trust, user safety, and retrieval quality.

Retrieval safety as a first-class concern

URL safety is part of retrieval science here, not a later hardening step. OWASP names mishandled URLs and custom webhooks as server-side request forgery triggers. Python's URL tools also warn that parsing does not validate input and that urljoin can replace a trusted host when the second value is absolute. Horizon treats monitor URLs, document links, feed links, sitemap links, rendered-page navigation, and callback URLs as outbound execution surfaces.

A safe retrieval path checks the scheme, host, port, DNS result, IP range, redirect chain, content type, and byte size before it trusts the response. Requests follows redirects by default, so a system that validates only the first URL can still fetch a forbidden target after a redirect. Requests also states that its timeout is not a total download limit. Horizon therefore treats timeout, byte count, redirect count, and decoded body size as separate controls.

Rendered HTML needs a narrow role. Playwright notes that page.goto() does not throw for normal HTTP 404 or 500 responses, and that networkidle is a discouraged wait state. Official pages often keep analytics, banners, live widgets, or tag scripts open. Horizon does not treat a quiet network as proof that official text loaded. It waits for source-relevant content or stable extracted text, then records the rendered route as a fallback route.

PDFs need page state. A legal or supervisory document may place the operative rule after a cover page, contents page, legal basis, and recital block. PyMuPDF warns that extracted text may not appear in natural reading order. Horizon therefore treats a PDF as a document with pages, headings, tables, dates, byte hashes, and page-level evidence. A same-URL PDF can change even when its link does not.

Change-unit selection

The third contribution is change-unit selection. A raw page diff is too crude. Official change can occur as a new feed item, an altered Atom entry, a changed register row, a deleted warning, a replaced deadline, a new PDF, a removed PDF, or a modified paragraph. Horizon moves the unit of comparison from the whole page to the official artefact, and then to smaller units inside that artefact.

Deletion matters. A removed warning or withdrawn consultation can carry more legal weight than a new paragraph. A system that tracks only added text will miss that signal. Horizon records additions, removals, replacements, link changes, document changes, date changes, and status changes as distinct event types.

The model as late classifier

The language model is not the retriever. It is a late classifier and writer. The retrieval layer builds route records, content records, document records, and change records first. The model then decides whether the change matters and writes a short explanation tied to source units. OpenAI's Structured Outputs can bind model text to a supplied JSON Schema, which reduces malformed field output, but refusals and incomplete outputs still need typed handling.

An alert should not be a free paragraph. It should carry a source unit: feed entry, API record, HTML block, register row, PDF page, or sitemap-derived URL after content verification. Each claim should point back to a route, a content hash, a timestamp, and a short quoted fragment. That creates a verifiable change record.

Telemetry and delivery

Telemetry needs the same discipline. OpenTelemetry says full URLs must not contain credentials and that sensitive query content should be scrubbed. Prometheus warns against labels with high cardinality, including user IDs and emails. Horizon records detailed run facts in an internal ledger, while public counters stay coarse: route type, source domain, run state, parser state, and decision state.

Callback delivery is also an evidence problem. GitHub signs webhook payloads with an HMAC over the payload and a secret. Horizon applies the same idea to outbound alert delivery: event ID, timestamp, body hash, and delivery record. The receiver can verify that the message came from Horizon and that the body did not change in transit.

Four ledgers

The scientific value sits in the binding of four ledgers: route, representation, change, and decision. The route ledger says how the official object was reached. The representation ledger says what bytes or text were accepted. The change ledger says what unit changed. The decision ledger says why the system did or did not alert. Together they turn a pasted URL into a record that can be checked later.

Horizon's answer is therefore not "scrape harder." It is to reduce the pasted URL to a routing problem, read the best official artefact available, reject unsafe or degraded retrieval, compare the right unit, and attach each generated claim to source evidence. The design accepts a hard fact: official sources will change shape. The system should not hide that. It should name the route, name the state, and preserve the proof.

From public URL to official record: source-route inference in Horizon

The route hierarchy

The accepted-source ledger

Retrieval safety as a first-class concern

Change-unit selection

The model as late classifier

Telemetry and delivery

Four ledgers

More from the journal

EU Council Formally Adopts AI Omnibus, Extending High-Risk AI Deadlines to December 2027

FCA Finalises UK Cryptoasset Regime Rules, Authorisation Window Opens 30 September 2026

European Commission Publishes Code of Practice on Marking and Labelling AI-Generated Content, June 2026

Ready to launch without the regulatory guesswork?

Try Licentium AI

Browse the Fintech Licensing Hub

Talk to us