Platform score: 0.480

Shafi

Evidence-first legal QA over DIFC regulations. Built in 13 days for the Agentic RAG Legal Challenge 2026 (350+ teams). Internal estimate was ~0.92. The platform returned 0.48. The 48-point gap from distribution shift is the interesting part.

By the numbers

Raw measurements from 13 days of competition. The internal G score was 0.9956 on the public trial set. The platform returned Total = 0.480 on the private set.

1,931 ms
TTFT after prompt caching
Down from 5,086 ms. Prompt caching was the single largest improvement to any metric. Standard retrieval techniques (BM25, HyDE, RAG Fusion) all failed to move the score.
16
Experiments run
87.5% rejection rate (14 of 16). Only 2 experiments produced measurable improvements. Negative results documented in the article.
172
Python source files
~65,500 lines of code. Up from 49 files on Day 1. Pipeline, retrieval, grounding, ML, generation, and ingestion subsystems.
1,595
Commits in 13 days
March 10-22, 2026. 12 AI agents running concurrently via Claude Code CLI, pulling tasks from a shared FastAPI server. 737 tickets processed.
12
AI agents
Claude Opus 4.6 instances coordinated via Bernstein orchestrator. Final 47 hours: 737 tickets across all 12 agents.

What failed, what worked

16 experiments, 87.5% rejection rate. Every standard RAG retrieval improvement failed. Infrastructure optimization beat all retrieval techniques.

Rejected

BM25 standalone retrieval

Added latency without grounding gain. Failed telemetry gating on the validation slice.

Rejected

RAG Fusion

Multiple query reformulations. Increased latency, no measurable improvement to answer quality or grounding score.

Rejected

HyDE (Hypothetical Document Embeddings)

Generated hypothetical answers to improve retrieval. Caused answer quality regression on the validation set.

Rejected

Step-back prompting

Abstracting to a broader question before retrieval. No score improvement, added an extra LLM call per question.

Worked

Prompt caching

TTFT dropped from 5,086 ms to 1,931 ms. The single largest improvement to any metric across all 16 experiments. Infrastructure beat sophistication.

Key finding

Distribution shift

Internal G = 0.9956 on the public trial set. Platform Total = 0.480 on the private set. 48 percentage points of gap from distribution shift. The system overfit to the public set's patterns without explicit memorization.

Pipeline architecture

LangGraph DAG with structured fast paths, hybrid retrieval, independent grounding sidecar, and typed answer routing.

Highlighted in amber = conditional/fast-exit path. DB Answerer and Confidence gate both short-circuit to Emit. Grounding sidecar runs independently from the answer path.

Stack

Component-level detail. All of this produced an internal estimate of ~0.92 and a platform result of 0.480.

Component Technology Detail
Embeddings Kanon-2 1792d Dense vectors, fused with BM25 sparse via Qdrant server-side RRF
Vector store Qdrant 4 collections: chunks, pages, segments, bridge_facts
Reranker Zerank-2 TOP_N=12 chunks after retrieval
Generation GPT-4.1 / GPT-4.1-mini 4.1 for free_text, mini for strict types (boolean, number, date, name, names)
Grounding LightGBM Page scorer sidecar, scope classifier, evidence portfolio optimizer, 14 modules
Orchestration LangGraph DAG with conditional fast paths, retry gate, confidence scoring
Agent swarm Claude Opus 4.6 12 agents via Claude Code CLI, Bernstein orchestrator, FastAPI task server

Agent roster

12 Claude Opus 4.6 instances. 737 tickets in the final 47 hours.

KEREN
DAGAN
SHAI
OREV
EYAL
NOGA
TAMAR
NOAM
KESHET
LIRON
TZUF
GILAD