From the journal

From public URL to official record: source-route inference in Horizon

Horizon treats the URL as an observation, then tests which official route gives the cleanest record, the safest fetch path, and the strongest proof of change.

Illia ProkopievCo-Founder and CEO8 min read

Horizon is regulatory intelligence for fintech and crypto compliance. We watch the official sources across EU regulators, six themes, and country-level NCAs, and turn substantive changes into structured, plain-English alerts grounded in source quotes and extracted obligations, delivered via email, Slack, Telegram, webhook, or REST API. Here I explain how it works.

Official publications no longer live in one place. A government notice may appear as a web page, an API record, a feed entry, a PDF, a register row, a legal identifier, or a sitemap URL. A pasted URL points to one surface. It rarely proves that the surface is the best route to the official record.

Horizon treats the pasted URL as an observation, not as the source itself. The system asks a stricter question: which official route gives the cleanest record, the least page noise, the safest fetch path, and the strongest proof of change?

The scientific term for that task is source-route inference. A URL enters the system. The system tests the available routes. It then binds the retrieved object to a change record. The object may be a notice, filing, consultation, register entry, warning, law text, speech, data release, or document section.

The route hierarchy

Official publishers already expose machine routes. GOV.UK's search endpoint is public and rejects unknown or invalid parameters with HTTP 422. SEC's EDGAR APIs serve JSON without API keys and update filing data during the day. The EU Publications Office's Cellar exposes REST, SPARQL, RSS, and Atom around structured identifiers for EU publications. The FCA exposes a Register API and notes that the free service carries no uptime or issue-resolution promise. These facts show why an official-source system must search for route quality before it reads visible page text.

The route order is not a scraper trick. It is a way to respect the publisher's own structure. Native APIs come first when they exist and fit the task. Feeds come next because they expose item-level records. Sitemaps help find new or changed URLs. Registers and legal stores carry stable identifiers. Static HTML and rendered HTML remain fallback routes. PDFs need document-level handling, not a short text prefix.

Feeds are not all equal. RSS treats guid as optional and leaves its syntax to the feed source. Atom requires an atom:id, treats it as a permanent identifier, and compares it character by character. RFC 5005 also defines paged feeds, where one feed URL may not contain the whole logical feed. Horizon therefore treats feed identity as source-specific state, not as a universal hash rule.

Sitemaps solve a different problem. They list URLs and optional update dates. A sitemap index can show when a sitemap file changed, not when every page inside changed. The sitemap protocol also caps a sitemap at 50,000 URLs and 50 MB uncompressed. Horizon reads sitemap dates as retrieval hints. It still verifies page or document content before it records a change.

HTML feed discovery has a precise signal. The HTML standard says link rel="alternate" with type="application/rss+xml" or type="application/atom+xml" identifies feeds for discovery. Body links do not carry the same signal. Horizon uses that distinction to separate a publisher's machine path from ordinary page links.

HTTP also carries change signals. RFC 9110 defines entity tags and modification dates for conditional requests. If-None-Match lets a client ask whether a stored entity tag still matches. If-Modified-Since lets a client avoid transfer when the selected representation has not changed. Horizon keeps these signals beside its own content hashes because server validators and local hashes answer different questions.

The accepted-source ledger

A page fetch can succeed and still fail as retrieval. The server may return a cookie shell, a search shell, a localized view, a blocked-bot page, a stale cache, or a document list without the document body. A system that records only "HTTP 200" cannot tell these cases apart. Horizon separates transport state, extraction state, content state, and decision state.

That separation creates the second contribution: the accepted-source ledger. A retrieved body does not become the next baseline merely because it arrived. It must pass route checks, content checks, extraction checks, and decision checks. If extraction collapses or a model call fails, the system records the run but does not let the broken body replace the prior accepted record.

The accepted-source ledger changes the meaning of silence. No alert can mean that the official route did not change. It can also mean that robots rules blocked the path, the fetch failed, extraction looked unsafe, the PDF parser failed, the model refused the decision, or the change fell outside the user's selected scope. Horizon makes these states distinct.

Robots.txt belongs in the ledger, but it is not proof of authority. RFC 9309 says crawler rules are requests that crawlers honor, not access authorization. Horizon records and respects crawler policy. It still uses separate checks for source trust, user safety, and retrieval quality.

Retrieval safety as a first-class concern

URL safety is part of retrieval science here, not a later hardening step. OWASP names mishandled URLs and custom webhooks as server-side request forgery triggers. Python's URL tools also warn that parsing does not validate input and that urljoin can replace a trusted host when the second value is absolute. Horizon treats monitor URLs, document links, feed links, sitemap links, rendered-page navigation, and callback URLs as outbound execution surfaces.

A safe retrieval path checks the scheme, host, port, DNS result, IP range, redirect chain, content type, and byte size before it trusts the response. Requests follows redirects by default, so a system that validates only the first URL can still fetch a forbidden target after a redirect. Requests also states that its timeout is not a total download limit. Horizon therefore treats timeout, byte count, redirect count, and decoded body size as separate controls.

Rendered HTML needs a narrow role. Playwright notes that page.goto() does not throw for normal HTTP 404 or 500 responses, and that networkidle is a discouraged wait state. Official pages often keep analytics, banners, live widgets, or tag scripts open. Horizon does not treat a quiet network as proof that official text loaded. It waits for source-relevant content or stable extracted text, then records the rendered route as a fallback route.

PDFs need page state. A legal or supervisory document may place the operative rule after a cover page, contents page, legal basis, and recital block. PyMuPDF warns that extracted text may not appear in natural reading order. Horizon therefore treats a PDF as a document with pages, headings, tables, dates, byte hashes, and page-level evidence. A same-URL PDF can change even when its link does not.

Change-unit selection

The third contribution is change-unit selection. A raw page diff is too crude. Official change can occur as a new feed item, an altered Atom entry, a changed register row, a deleted warning, a replaced deadline, a new PDF, a removed PDF, or a modified paragraph. Horizon moves the unit of comparison from the whole page to the official artefact, and then to smaller units inside that artefact.

Deletion matters. A removed warning or withdrawn consultation can carry more legal weight than a new paragraph. A system that tracks only added text will miss that signal. Horizon records additions, removals, replacements, link changes, document changes, date changes, and status changes as distinct event types.

The model as late classifier

The language model is not the retriever. It is a late classifier and writer. The retrieval layer builds route records, content records, document records, and change records first. The model then decides whether the change matters and writes a short explanation tied to source units. OpenAI's Structured Outputs can bind model text to a supplied JSON Schema, which reduces malformed field output, but refusals and incomplete outputs still need typed handling.

An alert should not be a free paragraph. It should carry a source unit: feed entry, API record, HTML block, register row, PDF page, or sitemap-derived URL after content verification. Each claim should point back to a route, a content hash, a timestamp, and a short quoted fragment. That creates a verifiable change record.

Telemetry and delivery

Telemetry needs the same discipline. OpenTelemetry says full URLs must not contain credentials and that sensitive query content should be scrubbed. Prometheus warns against labels with high cardinality, including user IDs and emails. Horizon records detailed run facts in an internal ledger, while public counters stay coarse: route type, source domain, run state, parser state, and decision state.

Callback delivery is also an evidence problem. GitHub signs webhook payloads with an HMAC over the payload and a secret. Horizon applies the same idea to outbound alert delivery: event ID, timestamp, body hash, and delivery record. The receiver can verify that the message came from Horizon and that the body did not change in transit.

Four ledgers

The scientific value sits in the binding of four ledgers: route, representation, change, and decision. The route ledger says how the official object was reached. The representation ledger says what bytes or text were accepted. The change ledger says what unit changed. The decision ledger says why the system did or did not alert. Together they turn a pasted URL into a record that can be checked later.

Horizon's answer is therefore not "scrape harder." It is to reduce the pasted URL to a routing problem, read the best official artefact available, reject unsafe or degraded retrieval, compare the right unit, and attach each generated claim to source evidence. The design accepts a hard fact: official sources will change shape. The system should not hide that. It should name the route, name the state, and preserve the proof.

Illia Prokopiev

Written by

Illia Prokopiev

Co-Founder and CEO

Illia is the Managing Partner and founder of Licentium. With over 11 years of practice, he has guided innovators through cross-border M&A deals and the disputes that follow, combining transactional skill with courtroom resolve. Admitted to the bar in 2017, he pivoted early to Web3, serving as legal advisor to prominent crypto projects and carrying AML/MLRO duties that anchored complex token, DAO, and compliance questions on solid regulatory ground. Certified in money laundering prevention and an active crypto investor, Illia blends market intuition with a global network of specialists, enabling Licentium to untangle licensing knots for crypto and AI ventures anywhere in the world.

More from the journal

See all

China Issues Interim Measures on Anthropomorphic AI Interaction Services, Effective 15 July 2026

On 10 April 2026 the Cyberspace Administration of China issued the Interim Measures for the Administration of Anthropomorphic AI Interaction Services together with the National Development and Reform Commission, the Ministry of Industry and Information Technology, the Ministry of Public Security, and the State Administration for Market Regulation. The Measures regulate virtual companions, AI chatbots, and emotionally responsive digital assistants. The Measures take effect on 15 July 2026.

FDIC Approves Notice of Proposed Rulemaking on Payment Stablecoins Under GENIUS Act, 7 April 2026

The Federal Deposit Insurance Corporation approved a Notice of Proposed Rulemaking on 7 April 2026 implementing the GENIUS Act for payment stablecoin issuers. The proposal sets a two-business-day redemption obligation, custodial and safekeeping duties, reserve-asset rules, and clarifies the treatment of tokenized deposits. Reserves backing a payment stablecoin do not pass through deposit insurance to holders. Comments close 60 days after Federal Register publication.

Georgia Enacts SB 540 Conversational AI Safety Act, Effective 1 July 2027

Georgia Governor Brian Kemp signed Senate Bill 540 in spring 2026, enacting the state's first dedicated AI companion chatbot statute. The Act takes effect on 1 July 2027. Operators must provide parental controls, age assurance for sexually explicit features, and privacy tools to minor users. The Attorney General may bring civil enforcement actions with penalties of up to 10,000 dollars per knowing violation, plus compensatory damages and attorneys' fees.

Ready to launch legally?

Book a 30-minute consultation. We'll map your licensing path and tell you exactly what's required, in plain language.