Chain Reliability — Validation Gates Between Steps

Core 6 min +25 XP

💡

THE ANALOGY

Quality gates on an assembly line. Finding a defect at station 3 costs 10 minutes. Finding the same defect at station 8 costs 10 minutes plus all the downstream work since station 3. Validate at each step — catching errors early is always cheaper.

⚠️ EXAM TRAP — The Wrong Answer People Choose

Validating only at the end of the chain. A bad step 2 output that passes undetected causes step 3 to produce a confidently wrong result — compounding the error.

KEY POINTS

1 Validate step output BEFORE passing to next step — format check AND quality check.

2 Lightweight evaluator Claude call between heavy steps: cheap quality assessment before expensive next step.

3 Retry the failing step before propagating failure — transient issues resolve on retry.

4 Low confidence outputs route to human review rather than continuing the chain.

5 Quality gates catch semantic errors that schema validation misses.

Validation Gate Pattern

async def validated_chain(document: str) -> dict:

    # Step 1: Extract
    extraction = await extract_clauses(document)

    # ── GATE 1: Schema validation ─────────────────────────
    schema_ok = validate_schema(extraction, required=["liability", "termination", "parties"])
    if not schema_ok["valid"]:
        # Retry with specific errors
        extraction = await extract_clauses(document, retry_errors=schema_ok["errors"])
        if not validate_schema(extraction, required=["liability", "termination", "parties"])["valid"]:
            return {"status": "needs_review", "failed_at": "extraction_schema"}

    # ── GATE 2: Lightweight quality evaluator ────────────
    quality = await call_claude(
        model="claude-haiku-4-5-20251001",  # cheap model
        system="Rate extraction quality: high | medium | low. Return only the word.",
        user=json.dumps(extraction)
    )
    if quality.strip() == "low":
        return {"status": "needs_review", "failed_at": "extraction_quality", "output": extraction}

    # ── GATE 3: Confidence check ─────────────────────────
    conf = extraction.get("confidence", {})
    if conf.get("overall") == "low":
        return {"status": "needs_review", "failed_at": "extraction_confidence", "output": extraction}

    # All gates passed — proceed to step 2
    risk = await score_risk(extraction)
    return {"status": "success", "extraction": extraction, "risk": risk}

The Lightweight Evaluator

# Cost: ~$0.001 per evaluation on Haiku
# Value: prevents running expensive next step on bad input

async def assess_quality(step_output: dict, context: str) -> str:
    """Quick quality assessment before expensive next step."""
    return (await call_claude(
        model="claude-haiku-4-5-20251001",
        system=f"""You are evaluating output quality for: {context}
Return ONLY one word: high | medium | low

high: complete, specific, all required fields present
medium: mostly complete, minor gaps
low: vague, incomplete, contradictory, or missing critical fields""",
        user=json.dumps(step_output)
    )).strip().lower()

Key Takeaways

Three gates per step: schema → quality → confidence
Lightweight evaluator (Haiku) before heavy next step (Sonnet)
Retry at step level — transient failures before propagating
Low confidence on critical fields → human review, not next step
Early validation is always cheaper than late detection

id: “d4-t5-2-retry-patterns” title: “Retry Patterns — Effective Error Recovery” domain: “d4” taskRef: “T4.5” order: 14 xp: 25 tag: “Core” duration: “6 min” analogy: “A proofreader marking up a draft with specific corrections. Not ‘this is wrong’ — but red marks on each specific error with notes on what’s correct. The author addresses each mark precisely. Claude’s retry needs the same specificity.” examTrap: “Retrying with the original prompt unchanged — produces the same wrong output. The retry prompt must include all three: original document, previous failed output, and specific errors.” keyPoints:

“Retry prompt three required parts: original document + failed output + specific errors.”
“Without failed output: Claude doesn’t know what to change.”
“Without specific errors: Claude doesn’t know what’s wrong.”
“Error format: field name + current value + expected value/format.”
“Never retry when source document lacks the information — produces fabrication.” antiPatterns:
“Retry with only original prompt — identical wrong output”
“Vague error messages: ‘date invalid’ vs ‘invoice_date must be YYYY-MM-DD, got Jan 15 2024’”
“Retrying source-document-missing errors — fabrication results”
“Unlimited retries — cap at 2-3 before human review” tbChallenge: “Write the complete retry prompt for this failure: extractor returned invoice_date=‘January 15, 2024’ and total_amount=‘$4,999.00’. Show all three components.”

The Three-Component Retry Prompt

def build_retry_prompt(document: str, failed_output: dict, errors: list) -> str:
    error_text = "\n".join([
        f"• {e['field']}: {e['error']}"
        + (f"\n  Got: {e['got']}" if 'got' in e else "")
        + (f"\n  Expected: {e['expected']}" if 'expected' in e else "")
        for e in errors
    ])

    return f"""Your previous extraction had validation errors. Fix the specific issues below.

ORIGINAL DOCUMENT:
{document}

YOUR PREVIOUS (INCORRECT) EXTRACTION:
{json.dumps(failed_output, indent=2)}

ERRORS TO FIX:
{error_text}

Fix only these errors. Do not change correct fields.
If a value is not in the document, return null — do not invent values."""

# Example errors for the teach-back scenario:
errors = [
    {
        "field":    "invoice_date",
        "error":    "Must be ISO format YYYY-MM-DD",
        "got":      "January 15, 2024",
        "expected": "2024-01-15"
    },
    {
        "field":    "total_amount",
        "error":    "Must be a decimal number in dollars, not a string with currency symbol",
        "got":      "$4,999.00",
        "expected": 4999.00
    }
]

Retry Decision

NOT_RETRYABLE = ["not found in source", "not in the document", "absent from", "document does not contain"]

def should_retry(error: dict) -> bool:
    return not any(p in error.get("error", "").lower() for p in NOT_RETRYABLE)

async def extract_with_retry(document: str, max_retries: int = 2) -> dict:
    last_output, last_errors = None, None
    for attempt in range(max_retries + 1):
        output = await extract(document, last_output, last_errors)
        validation = validate(output, document)
        if validation["valid"]:
            return {"status": "success", "output": output}
        retryable = [e for e in validation["errors"] if should_retry(e)]
        if retryable and attempt < max_retries:
            last_output, last_errors = output, retryable
        else:
            return {"status": "needs_review", "output": output}
    return {"status": "needs_review", "output": last_output}

Key Takeaways

Three components: document + failed output + specific errors — all required
Field + got + expected in every error
Never retry missing-data errors — fabrication
Cap retries at 2-3 — escalate after
“Do not change correct fields” — targeted correction

id: “d4-t5-3-batch-processing” title: “Batch Processing — Message Batches API” domain: “d4” taskRef: “T4.5” order: 15 xp: 30 tag: ”⚡ Exam Tested” duration: “7 min” analogy: “Overnight cloud processing vs same-day in-store printing. Overnight is 50% cheaper and handles thousands at once — but you get results the next morning. The Batch API is the overnight option: cheaper, scalable, not for anything urgent.” examTrap: “Using Batch API for anything with a latency requirement. No guaranteed SLA — up to 24 hours. Using it for CI/CD checks, user-blocking workflows, or real-time features is an architectural error. The exam specifically tests this distinction.” keyPoints:

“Batch API: 50% cost reduction. Up to 24-hour processing. NO latency SLA.”
“Use for: nightly batch jobs, non-urgent large volumes, cost-sensitive backfills.”
“Never use for: CI/CD pipeline checks, user-blocking features, real-time needs.”
“custom_id field correlates your requests with responses — required for partial failure recovery.”
“Partial failure: resubmit ONLY failed requests by custom_id — not the whole batch.” antiPatterns:
“Batch API for pre-merge CI/CD — pipeline may wait 24 hours”
“No custom_id — can’t identify which requests failed”
“Resubmitting entire batch on partial failure — wastes the 50% saving”
“Assuming synchronous completion — must poll for status” tbChallenge: “You have 10,000 invoices to process. Some will fail schema validation after extraction. Design the complete Batch API strategy: structure, polling, partial failure handling.”

Batch vs Real-Time Decision

def choose_api(request: dict) -> str:
    if request.get("user_is_waiting"):         return "real_time"
    if request.get("is_ci_cd_check"):          return "real_time"
    if request.get("latency_sla_minutes"):     return "real_time"
    if request.get("non_urgent_batch_job"):    return "batch_api"
    if request.get("volume", 0) > 5000:        return "consider_batch"
    return "real_time"  # default to real-time when uncertain

Complete Batch Workflow

import anthropic, time

client = anthropic.Anthropic()

# 1. Build requests with custom_ids
requests = [
    {
        "custom_id": f"inv-{inv_id}",  # YOUR correlation key
        "params": {
            "model":      "claude-sonnet-4-6",
            "max_tokens": 2048,
            "tools":       [invoice_extraction_tool],
            "tool_choice": {"type": "tool", "name": "extract_invoice"},
            "messages": [{"role": "user", "content": f"Extract: {text}"}]
        }
    }
    for inv_id, text in invoices.items()
]

# 2. Submit
batch = client.beta.messages.batches.create(requests=requests)
batch_id = batch.id

# 3. Poll
while True:
    status = client.beta.messages.batches.retrieve(batch_id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

# 4. Process results
succeeded, failed = [], []
for result in client.beta.messages.batches.results(batch_id):
    inv_id = result.custom_id.replace("inv-", "")
    if result.result.type == "succeeded":
        succeeded.append({"id": inv_id, "data": parse_tool_result(result.result.message)})
    else:
        failed.append({"id": inv_id, "custom_id": result.custom_id})

# 5. Resubmit ONLY failed (not the whole batch)
if failed:
    failed_ids = {f["custom_id"] for f in failed}
    retry = [r for r in requests if r["custom_id"] in failed_ids]
    client.beta.messages.batches.create(requests=retry)

Key Takeaways

50% cheaper, up to 24h — non-urgent work only
Never for latency-sensitive work — CI/CD, user-blocking
custom_id is non-negotiable — partial failure recovery
Resubmit only failed requests — don’t resubmit successes
Poll for completion — asynchronous, not synchronous

id: “d4-t6-1-review-arch” title: “Multi-Instance Review — Why Self-Review Fails” domain: “d4” taskRef: “T4.6” order: 16 xp: 35 tag: ”⚡ Exam Tested” duration: “8 min” analogy: “Proofreading your own writing. Your brain autocorrects as you read — you see what you intended, not what’s there. A fresh reader has no such bias. Claude reviewing its own code has the same problem: it retains the reasoning context from generation.” examTrap: “Thinking ‘ask Claude to review carefully’ equals independent review. It does not. The model retains its generation reasoning regardless of how you phrase the review request. The fix is a completely separate API call with no shared context.” keyPoints:

“Self-review fails: Claude retains reasoning context from generation — biased toward validating its own decisions.”
“Independent review: fresh API call with NO knowledge of how the code was generated.”
“Tell the reviewer explicitly: ‘Do not assume this code is correct. Find errors.’”
“Multi-pass: per-file reviews (parallel) → cross-file integration (sequential).”
“Confidence self-reporting alongside findings enables routing to human review.” antiPatterns:
“Same session for generation and review — reviewer inherits generator’s reasoning”
“Passing generation conversation history to reviewer — same effect”
“One massive review of all files — attention dilution”
“Not telling reviewer to assume code might be wrong” tbChallenge: “A colleague says: ‘I just ask Claude to double-check its code before submitting. Works fine.’ What’s actually happening in that session, and why does it produce worse results than an independent instance?”

Why Self-Review Fails

# Generation session builds reasoning context:
# "I used a for loop here because X"
# "I skipped error handling because Y"
# "This pattern is correct because Z"

# ❌ Self-review in same session
generation_messages = [
    {"role": "user", "content": "Implement auth middleware"},
    {"role": "assistant", "content": "Here's the first draft..."},
    # ... multiple turns, Claude builds justification for each decision ...
    {"role": "assistant", "content": final_code},
]

review_in_same_session = generation_messages + [
    {"role": "user", "content": "Review your code for bugs"}
    # Claude's reasoning: "I know WHY I made each decision, they were correct"
    # Result: confirms its own decisions rather than questioning them
]

The Independent Review Call

# ✅ Separate API call — zero generation context

async def independent_review(code: str, task: str) -> dict:
    return await call_claude(
        messages=[{
            "role": "user",
            "content": f"""Review this code for bugs and security issues.

Task it implements:
{task}

Code:
{code}

CRITICAL: Do NOT assume this code is correct.
Your job is to find bugs, vulnerabilities, and logic errors — not confirm correctness.

Return JSON: {{findings: [{{severity, description, line, fix}}], verdict: approved|needs_changes}}"""
        }]
        # No generation messages — fresh context, no inherited reasoning
    )

Multi-Pass Architecture

async def full_pr_review(files: list[str]) -> dict:
    # Pass 1: per-file, parallel, full attention per file
    per_file = await asyncio.gather(*[
        independent_review(read_file(f), f"File: {f}")
        for f in files
    ])

    # Pass 2: cross-file integration, sequential
    integration = await call_claude(
        messages=[{
            "role": "user",
            "content": f"""Find cross-file integration issues ONLY.
Per-file reviewers already checked individual correctness.
Focus on: interface mismatches, shared state bugs, contradictions between files.

Per-file findings:
{json.dumps(per_file)}"""
        }]
    )

    return {"per_file": per_file, "integration": integration}

Key Takeaways

Self-review is biased — generation context persists in session
Independent review = separate API call, no generation history
Tell reviewer to assume code might be wrong — prevents confirmation
Per-file parallel, cross-file sequential — multi-pass
Confidence alongside findings enables human review routing

id: “d4-t6-2-multi-pass” title: “Multi-Pass Review — Preventing Attention Dilution” domain: “d4” taskRef: “T4.6” order: 17 xp: 25 tag: “Core” duration: “6 min” analogy: “Medical specialists vs one generalist reviewing 20 lab results. The cardiologist reviews cardiac results with full attention. The neurologist reviews neurological results with full attention. A generalist reviewing all 20 simultaneously misses interaction effects.” examTrap: “Sending all changed files to one Claude call. Attention dilution: with 20 files in one context, each file gets 1/20th of effective attention — issues that per-file review catches get missed.” keyPoints:

“Attention dilution: more files in one context = less effective attention per file.”
“Multi-pass: per-file reviews run in parallel (full attention), cross-file synthesis runs sequentially.”
“Cross-file pass receives structured per-file summaries — not raw file contents.”
“Contradictions between per-file findings surfaced in cross-file pass for human resolution.”
“Per-file context: one file = full context window. Fifteen files = ~1/15th effective attention.” antiPatterns:
“All files in one context — attention dilution”
“No cross-file pass — misses integration issues”
“Cross-file pass receiving raw files — re-introduces dilution at synthesis stage”
“Contradictions left unresolved — developers get conflicting guidance” tbChallenge: “Per-file review of auth.ts says getUser() returns null on error. Per-file review of orders.ts says getUser() throws UserNotFoundException. How does your cross-file pass detect this and what does it output?”

The Numbers

15 files, all-in-one review call:
  Total context: ~150k tokens
  Effective attention per file: ~10k tokens
  File 13 competes with context from files 1-12

15 files, per-file reviews (parallel):
  Context per call: full context window
  Effective attention per file: full
  File 13 reviewed with complete focus

Multi-Pass Implementation

async def multi_pass_review(files: list[str]) -> dict:

    # Pass 1: per-file, parallel, full context window each
    per_file = await asyncio.gather(*[
        call_claude(messages=[{
            "role": "user",
            "content": f"""Review {f} thoroughly. Focus ONLY on this file.

{read_file(f)}

Return JSON: {{
  "file": "{f}",
  "issues": [...],
  "interfaces": ["exported functions/types with their signatures"],
  "assumptions": ["what this file assumes about other code"]
}}"""
        }])
        for f in files
    ])

    # Pass 2: cross-file, sequential, receives summaries
    cross_file = await call_claude(messages=[{
        "role": "user",
        "content": f"""Find cross-file integration issues ONLY.
Per-file reviewers checked individual correctness.
Focus on: interface mismatches, contradictory assumptions, shared state bugs.

Per-file summaries:
{json.dumps(per_file)}

Return: {{
  "contradictions": [
    {{"function": "...", "file_a": "...", "says": "...", "file_b": "...", "says": "..."}}
  ],
  "integration_issues": [...]
}}"""
    }])

    contradictions = parse_contradictions(cross_file)
    return {
        "per_file": per_file,
        "integration": cross_file,
        "needs_human_resolution": contradictions
    }

Contradiction Example Output

{
  "contradictions": [
    {
      "function": "getUser()",
      "file_a": "auth.ts",
      "file_a_says": "returns null on error",
      "file_b": "orders.ts",
      "file_b_says": "throws UserNotFoundException on error",
      "human_question": "Which behavior is correct? The other file's call site must be updated."
    }
  ]
}

Key Takeaways

Attention dilution is real — per-file reviews catch more issues
Parallel per-file, sequential cross-file — the pattern
Cross-file receives summaries, not raw files
Contradictions surfaced — route to human for resolution
Two passes minimum — individual correctness + integration

id: “d4-t6-3-confidence-scoring” title: “Confidence Scoring — Routing Based on Model Certainty” domain: “d4” taskRef: “T4.6” order: 18 xp: 30 tag: “Core” duration: “7 min” analogy: “A doctor’s referral system. Confident diagnosis → prescribe and discharge. Uncertain → specialist referral. Alarmed → emergency. Routing based on confidence level means the right resource handles the right cases — not every case going to the most expensive resource.” examTrap: “Treating confidence as binary accept/reject. The exam tests that confidence scoring enables tiered ROUTING — different confidence levels trigger different workflows, not just yes/no.” keyPoints:

“Three routing tiers: high confidence → auto-process, medium → spot-check, low → human review.”
“Field-level confidence: individual fields in one extraction can route differently.”
“Critical fields always escalate to human review when low confidence — regardless of overall score.”
“Calibration: measure whether self-reported confidence correlates with actual accuracy.”
“Recalibrate when miscalibration detected — ‘high’ that’s only 72% accurate is actually ‘medium’.” antiPatterns:
“Binary confidence — confident or not confident loses routing granularity”
“No routing logic — confidence score collected but never used”
“Uncalibrated confidence — assuming self-reports are accurate without measuring”
“Same human review priority for all low-confidence — critical fields need faster escalation” tbChallenge: “Extraction reports: vendor_name (high), invoice_date (high), total_amount (low). What does routing do? How is this different from all-fields-low? Why does field-level confidence matter more than just overall?”

Field-Level Confidence Schema

{
    "vendor_name":  "Acme Corp",
    "invoice_date": "2024-01-15",
    "total_amount": 4999.00,
    "confidence": {
        "overall": "medium",
        "low_confidence_fields": [
            {
                "field":  "total_amount",
                "reason": "Multiple amount figures on invoice; unclear which is final total"
            }
        ]
    }
}

Routing Logic

CRITICAL_FIELDS = {"total_amount", "invoice_number", "vendor_name", "due_date"}

def route(extraction: dict) -> dict:
    conf = extraction.get("confidence", {})
    overall = conf.get("overall", "low")
    low_fields = conf.get("low_confidence_fields", [])
    low_critical = [f for f in low_fields if f["field"] in CRITICAL_FIELDS]

    # High confidence, no uncertain fields
    if overall == "high" and not low_fields:
        return {"route": "auto_process"}

    # Critical field uncertain — highest priority human review
    if low_critical:
        return {
            "route": "human_review",
            "priority": "high",
            "focus_fields": [f["field"] for f in low_critical],
            "reason": "Critical financial field has low confidence"
        }

    # Overall low or too many uncertain fields
    if overall == "low" or len(low_fields) > 2:
        return {"route": "human_review", "priority": "standard"}

    # Medium overall, few non-critical uncertain fields
    return {
        "route": "spot_check",
        "check_fields": [f["field"] for f in low_fields]
    }

# For the teach-back scenario:
# vendor_name (high), invoice_date (high), total_amount (low + critical)
# → human_review, priority=high, focus=["total_amount"]
# NOT the same as all-low: that routes to standard human review
# Critical field uncertainty always escalates to high-priority

Calibration

def calibrate(sample: list[dict]) -> dict:
    tiers = {"high": [], "medium": [], "low": []}
    for r in sample:
        tiers[r["confidence"]["overall"]].append(r["human_verified_correct"])

    result = {}
    for tier, vals in tiers.items():
        if vals:
            acc = sum(vals) / len(vals)
            result[tier] = {
                "accuracy": f"{acc:.0%}",
                "well_calibrated": (acc > 0.90 if tier == "high"
                    else 0.70 < acc <= 0.90 if tier == "medium"
                    else acc <= 0.70)
            }
    return result

# If "high" confidence is only 72% accurate:
# → Claude's "high" is actually "medium"
# → Adjust routing: treat "high" as "medium" until recalibrated

Key Takeaways

Confidence enables routing — three tiers, three actions
Field-level confidence — individual fields can route differently
Critical fields always escalate when uncertain
Calibrate — measure self-reported confidence against actual accuracy
Recalibrate thresholds when miscalibration detected