Shafi
Evidence-first legal QA over DIFC regulations. Built in 13 days for the Agentic RAG Legal Challenge 2026 (350+ teams). Internal estimate was ~0.92. The platform returned 0.48. The 48-point gap from distribution shift is the interesting part.
By the numbers
Raw measurements from 13 days of competition. The internal G score was 0.9956 on the public trial set. The platform returned Total = 0.480 on the private set.
What failed, what worked
16 experiments, 87.5% rejection rate. Every standard RAG retrieval improvement failed. Infrastructure optimization beat all retrieval techniques.
BM25 standalone retrieval
Added latency without grounding gain. Failed telemetry gating on the validation slice.
RAG Fusion
Multiple query reformulations. Increased latency, no measurable improvement to answer quality or grounding score.
HyDE (Hypothetical Document Embeddings)
Generated hypothetical answers to improve retrieval. Caused answer quality regression on the validation set.
Step-back prompting
Abstracting to a broader question before retrieval. No score improvement, added an extra LLM call per question.
Prompt caching
TTFT dropped from 5,086 ms to 1,931 ms. The single largest improvement to any metric across all 16 experiments. Infrastructure beat sophistication.
Distribution shift
Internal G = 0.9956 on the public trial set. Platform Total = 0.480 on the private set. 48 percentage points of gap from distribution shift. The system overfit to the public set's patterns without explicit memorization.
Pipeline architecture
LangGraph DAG with structured fast paths, hybrid retrieval, independent grounding sidecar, and typed answer routing.
Stack
Component-level detail. All of this produced an internal estimate of ~0.92 and a platform result of 0.480.
| Component | Technology | Detail |
|---|---|---|
| Embeddings | Kanon-2 1792d | Dense vectors, fused with BM25 sparse via Qdrant server-side RRF |
| Vector store | Qdrant | 4 collections: chunks, pages, segments, bridge_facts |
| Reranker | Zerank-2 | TOP_N=12 chunks after retrieval |
| Generation | GPT-4.1 / GPT-4.1-mini | 4.1 for free_text, mini for strict types (boolean, number, date, name, names) |
| Grounding | LightGBM | Page scorer sidecar, scope classifier, evidence portfolio optimizer, 14 modules |
| Orchestration | LangGraph | DAG with conditional fast paths, retry gate, confidence scoring |
| Agent swarm | Claude Opus 4.6 | 12 agents via Claude Code CLI, Bernstein orchestrator, FastAPI task server |
Agent roster
12 Claude Opus 4.6 instances. 737 tickets in the final 47 hours.
Links
Everything is open source under AGPL-3.0.
Article
Full write-up on the competition, architecture decisions, what worked, what didn't, and the distribution shift story.
Shafi (pipeline)
The RAG pipeline. 172 Python files, ~65,500 LOC. Retrieval, generation, grounding, ingestion.
Bernstein (orchestrator)
The multi-agent orchestrator. Task server, agent coordination, ticket management for the Claude swarm.