Artifacts

Methodology, corpus, reproducibility.

Benchmark methodology

LawVM benchmarks replay output against real-world publication surfaces. For Finland, that means comparing replayed point-in-time text against the Finlex editorial consolidation.

The core rule: Benchmark scores are a proxy. Manual residual review against primary sources is the real verification loop. Similarity scores become meaningful only after the divergence type is classified.

Two metrics:

Levenshtein text distance: character-level normalized edit distance. Mean: 0.65%.
Structural section error: section-level structural divergence. Mean: 4.25%.

Some divergences become high-confidence candidate findings when primary sources support LawVM over the Finlex consolidation. The residual taxonomy classifies each mismatch so that evaluation is not a single number but a typed evidence surface.

Corpus definition

The current Finnish alpha corpus is curated from a larger set of amended Finnish statutes. Curation criteria are structural, not success-based:

Base statute XML exists in the archived source corpus
XML is parseable and contains section structure
Oracle consolidated XML exists with non-empty body
All amendment texts available in the archive
At least one amendment

Decade span: 1920s–2020s. Amendment counts per statute: 1 to 238. Curation targets replayability rather than success. The curation script is scripts/curate_corpus.py.

Current benchmark snapshot

Metric	Value
Corpus	Finnish alpha corpus
Mean Levenshtein distance	0.65%
Mean structural error	4.25%
Perfect text match	~420
Perfect structural match	367
≥95% structural	490
<90% structural	104

Benchmark snapshot: 2026-04-16, mode: finlex_oracle. Figures are provisional and tied to the frozen source/oracle archive.

Golden dataset

The v0.1 alpha evidence exposes hundreds of replay-vs-Finlex divergences for triage. A subset of 22 high-confidence meaningful candidate findings has been reported to Finlex. These remain candidate findings pending confirmation by Finlex or another competent authority.

Internal entries document statute ID, title, verdict, root cause, Finnish prose summary, affected sections, and source evidence. The Finnish evidence viewer exists in the repository, but it is not linked from the public website surface yet. Public DOI-backed exports are planned but not yet published.

Verdicts: lawvm_ok (Finlex is wrong), mixed (both have issues), source_defect (source material broken), lawvm_bug (LawVM is wrong).

Reproducibility

uv sync
uv run lawvm bench --mode finlex_oracle --label reproduce

Replays the Finnish alpha corpus and reports metrics. Requires data/finlex.farchive (Finnish source and oracle archive). Results depend on archive contents at time of run: oracle consolidation surfaces change as Finlex editors update them. Frozen archive benchmarks are stable.

The source archive (finlex.farchive) is built from Finlex open data batch downloads. The acquisition scripts and benchmark tooling are in the repository.

Downloads

Artifact releases (Zenodo DOI-backed) are planned for:

Source and output snapshot for independent evaluation
Software release archive
Golden dataset export
Publication database (SQLite)

Status: Pending v0.1 artifact packaging. Links will appear here when available.

This is supporting evidence, not the core adoption path. Reproducible artifacts matter because they let institutions evaluate LawVM before using it in publication, audit, or source-authority workflows. See the handoff map.