Platform score: 0.480

Shafi

Evidence-first legal QA over DIFC regulations. Built in 13 days for the Agentic RAG Legal Challenge 2026 (350+ teams). Internal estimate was ~0.92. The platform returned 0.48. The 48-point gap from distribution shift is the interesting part.

Read the article Source code

By the numbers

Raw measurements from 13 days of competition. The internal G score was 0.9956 on the public trial set. The platform returned Total = 0.480 on the private set.

Total = 0.480

Platform score (private set)

Internal estimate was ~0.92. The gap is distribution shift: the private test set differed from the public trial set in ways the system did not generalize to.

1,931 ms

TTFT after prompt caching

Down from 5,086 ms. Prompt caching was the single largest improvement to any metric. Standard retrieval techniques (BM25, HyDE, RAG Fusion) all failed to move the score.

Experiments run

87.5% rejection rate (14 of 16). Only 2 experiments produced measurable improvements. Negative results documented in the article.

172

Python source files

~65,500 lines of code. Up from 49 files on Day 1. Pipeline, retrieval, grounding, ML, generation, and ingestion subsystems.

1,595

Commits in 13 days

March 10-22, 2026. 12 AI agents running concurrently via Claude Code CLI, pulling tasks from a shared FastAPI server. 737 tickets processed.

AI agents

Claude Opus 4.6 instances coordinated via Bernstein orchestrator. Final 47 hours: 737 tickets across all 12 agents.

What failed, what worked

16 experiments, 87.5% rejection rate. Every standard RAG retrieval improvement failed. Infrastructure optimization beat all retrieval techniques.

Rejected

BM25 standalone retrieval

Added latency without grounding gain. Failed telemetry gating on the validation slice.

Rejected

RAG Fusion

Multiple query reformulations. Increased latency, no measurable improvement to answer quality or grounding score.

Rejected

HyDE (Hypothetical Document Embeddings)

Generated hypothetical answers to improve retrieval. Caused answer quality regression on the validation set.

Rejected

Step-back prompting

Abstracting to a broader question before retrieval. No score improvement, added an extra LLM call per question.

Worked

Prompt caching

TTFT dropped from 5,086 ms to 1,931 ms. The single largest improvement to any metric across all 16 experiments. Infrastructure beat sophistication.

Key finding

Distribution shift

Internal G = 0.9956 on the public trial set. Platform Total = 0.480 on the private set. 48 percentage points of gap from distribution shift. The system overfit to the public set's patterns without explicit memorization.

Pipeline architecture

LangGraph DAG with structured fast paths, hybrid retrieval, independent grounding sidecar, and typed answer routing.

01 Classify heuristic, zero-latency | 02 Query Contract typed IR compiler | 03 DB Answerer <50ms fast path (13% of questions bypass LLM) | 04 Retrieve Kanon-2 1792d + BM25 RRF, Qdrant | 05 Zerank-2 reranker, TOP_N=12 chunks | 06 Confidence retry gate | 07 Retry widen retrieval (conditional) | 08 Generate GPT-4.1 (free_text) / GPT-4.1-mini (strict types) | 09 StrictAnswerer type coercion (boolean, number, date, name, names) | 10 Grounding LightGBM sidecar + portfolio optimizer (independent) | 11 Emit + Finalize SSE stream + telemetry

Highlighted in amber = conditional/fast-exit path. DB Answerer and Confidence gate both short-circuit to Emit. Grounding sidecar runs independently from the answer path.

Stack

Component-level detail. All of this produced an internal estimate of ~0.92 and a platform result of 0.480.

Component	Technology	Detail
Embeddings	Kanon-2 1792d	Dense vectors, fused with BM25 sparse via Qdrant server-side RRF
Vector store	Qdrant	4 collections: chunks, pages, segments, bridge_facts
Reranker	Zerank-2	TOP_N=12 chunks after retrieval
Generation	GPT-4.1 / GPT-4.1-mini	4.1 for free_text, mini for strict types (boolean, number, date, name, names)
Grounding	LightGBM	Page scorer sidecar, scope classifier, evidence portfolio optimizer, 14 modules
Orchestration	LangGraph	DAG with conditional fast paths, retry gate, confidence scoring
Agent swarm	Claude Opus 4.6	12 agents via Claude Code CLI, Bernstein orchestrator, FastAPI task server

Agent roster

12 Claude Opus 4.6 instances. 737 tickets in the final 47 hours.

KEREN

DAGAN

SHAI

OREV

EYAL

NOGA

TAMAR

NOAM

KESHET

LIRON

TZUF

GILAD

Shafi

By the numbers

What failed, what worked

BM25 standalone retrieval

RAG Fusion

HyDE (Hypothetical Document Embeddings)

Step-back prompting

Prompt caching

Distribution shift

Pipeline architecture

Stack

Agent roster

Links

Article

Shafi (pipeline)

Bernstein (orchestrator)