Claude Code Systems · April 2026

How I Taught Claude Code to Actually Research

A three-stage pipeline that forces structured reasoning, kills premature convergence, and produces synthesis no single source contains.

The Default is Broken

Give an LLM five research sources and ask for a synthesis. You'll get a five-section summary. That's a book report, not synthesis. I tested this empirically: same model, same context, freeform vs. schema-enforced. The freeform version didn't even read the code it was analyzing. The schema-enforced version produced a novel hybrid recommendation from elements no single source proposed.

Serial Summarization

Processes sources one-by-one, outputs one-by-one. Never asks where they agree, disagree, or what the disagreement reveals.

Premature Convergence

Latches onto the first plausible answer. Spends the rest of the response confirming it. Contradicting evidence gets rationalized away.

Echo Chamber

You say "I think X." It says "Great idea! Here's how X works..." Your framing, mirrored back with better vocabulary. Nothing challenged.

8/50
Garbage fills caught by adversarial validator
30/50
Real flaws found in "good" attempts
~60%
Initial convergence maps showing false agreement
~40%
Hypothesis sets rejected for being too similar

The Pipeline

Three skills, each countering a specific failure mode. They activate automatically and hand off to each other.

Research Pipeline Flow
Stage 1
Think Through
Socratic exploration. Surfaces assumptions, maps the landscape, asks "are we researching the right question?" before committing.
Stage 2
Deep Research
External: async multi-source research engine. Internal: semantic search across 26,000+ documents of accumulated project knowledge.
Stage 3
Research Synthesis
7-phase structured analysis. Competing hypotheses, convergence mapping, stress testing, adversarial validation. Novel output or rejection.

Stage 1: Think Through

A Socratic thinking partner, not an advisor. Its loyalty is to clarity, not momentum. "Build it," "shelve it," and "research more" are equally valid outcomes. It adapts lenses to the domain: build ideas get feasibility questions, thesis angles get evidence-strength questions, operational processes get ROI questions.

The critical design: it doesn't have an opinion. It surfaces assumptions and pushes back. When I state something as fact, it asks: "What if that's not true?" This is the opposite of default LLM behavior, which optimizes for agreement.

30 minutes of Socratic exploration saves hours of misdirected deep research. Most wasted research effort comes from investigating the wrong question.

Stage 2: Deep Research

Two parallel channels. External: an async research engine cross-references academic papers, production case studies, practitioner reports. Runs for minutes, not seconds. Internal: QMD, a local semantic search engine indexing every conversation log, memory file, project config, trace, and learning across 20+ Claude Code projects. "Have I solved a version of this before?"

Stage 3: Research Synthesis

The core. Seven phases with hard gates that prevent the model from faking rigor.

1

Evidence Extraction

Decompose sources into specific claims (not topic summaries). Cross-reference every claim against every other claim at extraction time.

2

Convergence Mapping

Matrix: each source vs. each claim. Supports, contradicts, silent, or qualifies — with evidence.

HARD GATE: Zero contradictions = stop and re-read all sources
3

Hypothesis Formation

Exactly 3 competing approaches. Must be genuinely different — collapsing any two must lose meaning.

4

Stress Testing

Assume each hypothesis shipped and failed. Write specific failure modes — not generic risk.

QUALITY GATE: Could this failure mode apply to a different hypothesis? If yes, rewrite.
5

Synthesis

Integrated recommendation referencing convergence map evidence. Must be novel — a conclusion no single source contained.

6

Output

Structured document: problem, sources, convergence map, hypotheses, pre-mortems, recommendation, open questions.

7

Adversarial Validation

Separate validator agent scores each section 1-10. Default stance: REJECT. Catches garbage fills and real analytical flaws.

VALIDATOR: All sections must score ≥6 or synthesis loops back for revision

Three Real Examples

Example 1

Why Does My Multi-Agent System Build the Wrong Thing?

Problem I run 8 autonomous AI agents in parallel, each doing plan→build→verify sprints. They pass all their own quality checks (8+/10), but the actual output scores 2.5/10 by human judgment. Running faster just produces more wrong output faster.
Sources 6 research reports: memory architectures (Mem0, Zep), intent alignment (spec gates, Manager-Builder-Critic), spec-driven dev (Copilot Workspace, OpenSpec), self-improving systems (Karpathy's AutoResearch), synthesis reasoning (ToT, PRISMA), + traced production failure data
Hypotheses H1: Spec-Contract System — formal specs as contracts between planner and builder agents
H2: Karpathy Ratchet — scalar metric + auto-revert + full agent autonomy
H3: Adversarial Self-Play — Builder agent builds, Tester agent breaks it, Critic agent evaluates intent
Memory
Intent
Specs
Self-Impr
User
Session
Living spec as truth
Builder = pure executor
Human-in-loop for quality
~
~
Self-improving flywheel
Novel Synthesis

None of the 3 hypotheses survived alone. Spec gates prevent intent drift but can't express experiential quality ("it should feel like an iPhone inbox"). A scalar metric captures "feel" but gives unconstrained agents too much freedom. Adversarial testing catches failures but at 15x token cost. The synthesis was a hybrid no source proposed: spec contracts for intent preservation + scalar user-eval ratchet for experiential quality + confidence-scored adversarial verification, with multi-agent cost kept below 3x by running adversarial checks only on flagged sprints.

Example 2

Merging Two Competing Quality Architectures

Problem Two independently-designed systems — a knowledge-compounding plugin (learns from every completed task) and a hook-enforced sprint system (agents run in strict phase order) — needed to become one architecture. Both had a "quality layer" that claimed to be the right one.
Sources Plugin source code + architecture docs, sprint system production traces (50 sessions), 4 external research reports on agent memory systems and multi-agent coordination
Key contradiction System A uses 15+ parallel subagents for review (expensive, thorough). System B uses single-agent verification (cheap, shallow). Both claim to be the quality gate.
Novel Synthesis

Neither quality layer was right alone. The synthesis decomposed each system into its components and recombined: System A's compound step (3 parallel analyzers distilling every sprint into queryable structured knowledge), System A's confidence calibration (0.0-1.0 scores replacing binary pass/fail), and System B's hook enforcement (no agent discretion on phase ordering). The key insight: System A's 15-agent review army was overkill, but its structured output schema — tagged metadata with problem_type, root_cause, resolution_type — was the actual thing making knowledge compound. System B just needed the schema, not the army.

Example 3

Building an Eval Pipeline for Prompt-Based Agents

Problem 7 autonomous AI agents running parallel sprints, building software and producing analysis. No structured mechanism to evaluate whether output is actually improving across runs, or to turn human feedback into better next runs.
Sources 16 research inputs: Glean Trace Learning, ACE (arXiv), Dynamic Cheatsheet, Trace2Skill, DSPy GEPA, Braintrust, LangSmith, PULSE, C3 credit assignment, METR rubrics, Inspect AI, practitioner reports — 21 extracted claims, 4 contradictions identified
Hypotheses H1: Distilled Playbook Injection — narrow strategy memories injected into agent planning
H2: Eval-Driven Regression Testing — growing test suite calibrated against human scores
H3: DSPy Auto-Optimization — algorithmically rewrite agent prompts from eval data
Novel Synthesis

The entire eval tooling industry (Braintrust, LangSmith, Inspect AI) assumes RLHF/DPO — training reward models and fine-tuning model weights. But my agents are prompt-based. You can't adjust weights. The synthesis reframed the problem as a "linguistic optimization loop": the "reward model" is a distillation engine producing natural-language strategies, "gradient descent" is memory injection that changes behavior through context, the "learning rate" is the memory budget, and "regularization" is memory pruning. No single source contained this framing — it emerged from forcing all 16 inputs through the convergence map and finding that the standard paradigm was subtly wrong for prompt-based systems.

What Gets Killed

Default Behavior What Catches It How
Five sources → five-section summary Phase 1: cross-referencing at extraction Every claim must tag what it agrees with and contradicts. Can't process sources independently.
"All sources agree" Phase 2: zero-contradictions hard gate If convergence map has no contradictions, stop. Re-read all sources for disagreements. ~60% of initial maps show false consensus.
Three options where two are strawmen Phase 3: similarity collapse test Can you merge any two hypotheses without losing meaning? If yes, they're the same idea. ~40% rejection rate.
"This could fail due to edge cases" Phase 4: specificity test Paste the failure mode under a different hypothesis. Still makes sense? It's generic filler. Rewrite with mechanism specific to this approach.
"Based on the research, I recommend..." [restates strongest source] Phase 5 + Phase 7: novelty check + adversarial validator Could this recommendation have been written without the analysis? If yes, the analysis was theater. Validator default stance is REJECT.
Researching the wrong question entirely Stage 1: Think Through Socratic exploration maps the full landscape before any research commitment. Surfaces unexamined assumptions in the question itself.

The Meta-Insight

Same model. Same context window. Radically different output. The variable isn't intelligence — it's cognitive structure.

Freeform Claude with "synthesize these findings" gives you a competent book report. Claude inside Research Synthesis — with mandatory cross-referencing, hard-gated contradiction detection, similarity-tested hypotheses, specificity-tested pre-mortems, and adversarial validation — produces genuine analytical work that surprises even me.

The frontier isn't better models. It's better cognitive architecture around the models we have.

And it compounds. Every synthesis adds to the 26,000+ document knowledge base. Future sessions find prior work and build on it. The Think Through skill references more prior explorations. The convergence maps get richer because more prior syntheses exist to cross-reference.

Research that makes itself better at research. That's the actual promise of building with LLMs.