Artifacts

Methodology, corpus, reproducibility.

Benchmark methodology

LawVM benchmarks replay output against real-world publication surfaces. For Finland, that means comparing replayed point-in-time text against the Finlex editorial consolidation.

The core rule: Benchmark scores are a proxy. Manual residual review against primary sources is the real verification loop. High similarity does not mean replay is correct. Low similarity does not mean replay is wrong. Divergence type matters.

Two metrics:

  • Levenshtein text distance — character-level normalized edit distance. Mean: 0.65%.
  • Structural section error — section-level structural divergence. Mean: 4.25%.

Some divergences mean LawVM is right and Finlex is wrong. The residual taxonomy (15 root cause categories) classifies each mismatch so that evaluation is not a single number but a typed evidence surface.

Corpus definition

690 statutes curated from 3,591 amended Finnish statutes. Curation criteria (all structural, no temporal filtering):

  1. Base statute XML exists in the archived source corpus
  2. XML is parseable and contains section structure
  3. Oracle consolidated XML exists with non-empty body
  4. All amendment texts available in the archive
  5. At least one amendment

Decade span: 1920s–2020s. Amendment counts per statute: 1 to 238. Not hand-picked for success — curated for replayability. The curation script is scripts/curate_corpus.py.

Current benchmark snapshot

MetricValue
Statutes690
Mean Levenshtein distance0.65%
Mean structural error4.25%
Perfect text match~420
Perfect structural match367
≥95% structural490
<90% structural104

Run: 2026-04-16, mode: finlex_oracle.

Golden dataset

77+ verified divergence entries (as of 2026-04-16, growing). Each entry documents: statute ID, title, verdict, root cause, Finnish prose summary, affected sections. Format: one YAML file per statute. Schema in notes/verified_finlex_errors/README.md.

Verdicts: lawvm_ok (Finlex is wrong), mixed (both have issues), source_defect (source material broken), lawvm_bug (LawVM is wrong).

Reproducibility

uv sync
uv run lawvm bench --mode finlex_oracle --label reproduce

Replays all 690 statutes and reports metrics. Requires data/finlex.farchive (Finnish source and oracle archive). Results depend on archive contents at time of run — oracle consolidation surfaces change as Finlex editors update them. Frozen archive benchmarks are stable.

The source archive (finlex.farchive) is built from Finlex open data batch downloads. The acquisition scripts and benchmark tooling are in the repository.

Downloads

Artifact releases (Zenodo DOI-backed) are planned for:

  • Frozen corpus snapshot
  • Software release archive
  • Golden dataset export
  • Publication database (SQLite)

Status: Pending corpus freeze. Links will appear here when available.