The Reliability Design Checklist
D1 Components (Orchestration)
D1_RELIABILITY_CHECKLIST = {
"loop_termination": {
"requirement": "Loop terminates on stop_reason, not text parsing",
"test": "Verify stop_reason == 'end_turn' triggers return",
"failure_mode": "Infinite loop or premature termination"
},
"tool_use_handling": {
"requirement": "All tool_use blocks processed, results batched in one turn",
"test": "Verify multi-tool responses handled correctly",
"failure_mode": "Missing tool calls, invalid conversation history"
},
"error_recovery": {
"requirement": "Coordinator makes informed retry/fallback decisions",
"test": "Simulate transient and permanent failures, verify different behaviors",
"failure_mode": "Retrying permanent failures, not retrying transient failures"
},
"human_handoff": {
"requirement": "Clear escalation for scope changes, irreversible actions, low confidence",
"test": "Verify escalation triggers fire with complete context",
"failure_mode": "Agent exceeds scope, or reviewer lacks decision context"
},
"state_persistence": {
"requirement": "Checkpoint after each significant operation",
"test": "Kill agent mid-run, verify it resumes correctly",
"failure_mode": "Lost progress on failure, duplicate work on resume"
}
}
D5 Components (Reliability Assurance)
D5_RELIABILITY_CHECKLIST = {
"structured_errors": {
"requirement": "All tool failures return {error_category, is_retryable, message}",
"test": "Simulate each error type, verify structured response",
"failure_mode": "Coordinator makes uninformed recovery decisions"
},
"confidence_scoring": {
"requirement": "Extractions include field-level confidence",
"test": "Verify confidence routes to appropriate tier",
"failure_mode": "Low-confidence results auto-processed, errors reach customers"
},
"calibration": {
"requirement": "Stratified sampling validates confidence accuracy",
"test": "Verify sample results match expected confidence accuracy",
"failure_mode": "Miscalibrated confidence — high-confidence results are wrong"
},
"partial_failure_handling": {
"requirement": "Synthesis knows which sources failed and why",
"test": "Simulate partial subagent failure, verify synthesis labels gaps",
"failure_mode": "Synthesis presents partial results as complete"
},
"context_management": {
"requirement": "Context window managed, tool results trimmed",
"test": "Run long session, verify context doesn't overflow",
"failure_mode": "Quality degrades as session length increases"
}
}
When Both Work Together
# D1: Coordinator spawns 5 subagents, handles their results
# D5: Subagents return structured errors, synthesis labels gaps
# What you get:
{
"status": "partial_success",
"results": {
"completed_tasks": 4,
"failed_tasks": 1,
"failure_detail": "EU regulatory access denied (permission error, permanent)",
},
"synthesis": "Research complete for 4 of 5 areas...",
"confidence": "medium", # Reduced due to known gap
"data_gaps": ["EU regulatory landscape"],
"recommended_action": "human_review", # Due to medium confidence + known gap
}
When D1 Works But D5 Doesn’t
# D1: Correct orchestration — loop runs correctly, coordinator coordinates well
# D5: No structured errors — silent failures, no confidence scoring
# What you get:
{
"status": "success", # Looks like success — actually partial failure silently
"synthesis": "Research complete across all 5 areas...", # EU data is fabricated
"confidence": "high", # Uncalibrated — confidence not measured
# No indication of failure or gaps
}
# Result: incorrect data presented with false confidence
Key Takeaways
- D1 orchestrates, D5 validates — both required for production reliability
- Without D5 structured errors: D1 coordinator makes uninformed decisions
- Without D1 loop correctness: D5 error detection has nothing to act on
- The combination: informed coordination + quality assurance + calibrated confidence
- Production checklist: verify both D1 and D5 components before shipping