A three-stage pipeline that forces structured reasoning, kills premature convergence, and produces synthesis no single source contains.
Give an LLM five research sources and ask for a synthesis. You'll get a five-section summary. That's a book report, not synthesis. I tested this empirically: same model, same context, freeform vs. schema-enforced. The freeform version didn't even read the code it was analyzing. The schema-enforced version produced a novel hybrid recommendation from elements no single source proposed.
Processes sources one-by-one, outputs one-by-one. Never asks where they agree, disagree, or what the disagreement reveals.
Latches onto the first plausible answer. Spends the rest of the response confirming it. Contradicting evidence gets rationalized away.
You say "I think X." It says "Great idea! Here's how X works..." Your framing, mirrored back with better vocabulary. Nothing challenged.
Three skills, each countering a specific failure mode. They activate automatically and hand off to each other.
A Socratic thinking partner, not an advisor. Its loyalty is to clarity, not momentum. "Build it," "shelve it," and "research more" are equally valid outcomes. It adapts lenses to the domain: build ideas get feasibility questions, thesis angles get evidence-strength questions, operational processes get ROI questions.
The critical design: it doesn't have an opinion. It surfaces assumptions and pushes back. When I state something as fact, it asks: "What if that's not true?" This is the opposite of default LLM behavior, which optimizes for agreement.
30 minutes of Socratic exploration saves hours of misdirected deep research. Most wasted research effort comes from investigating the wrong question.
Two parallel channels. External: an async research engine cross-references academic papers, production case studies, practitioner reports. Runs for minutes, not seconds. Internal: QMD, a local semantic search engine indexing every conversation log, memory file, project config, trace, and learning across 20+ Claude Code projects. "Have I solved a version of this before?"
The core. Seven phases with hard gates that prevent the model from faking rigor.
Decompose sources into specific claims (not topic summaries). Cross-reference every claim against every other claim at extraction time.
Matrix: each source vs. each claim. Supports, contradicts, silent, or qualifies — with evidence.
HARD GATE: Zero contradictions = stop and re-read all sourcesExactly 3 competing approaches. Must be genuinely different — collapsing any two must lose meaning.
Assume each hypothesis shipped and failed. Write specific failure modes — not generic risk.
QUALITY GATE: Could this failure mode apply to a different hypothesis? If yes, rewrite.Integrated recommendation referencing convergence map evidence. Must be novel — a conclusion no single source contained.
Structured document: problem, sources, convergence map, hypotheses, pre-mortems, recommendation, open questions.
Separate validator agent scores each section 1-10. Default stance: REJECT. Catches garbage fills and real analytical flaws.
VALIDATOR: All sections must score ≥6 or synthesis loops back for revisionNone of the 3 hypotheses survived alone. Spec gates prevent intent drift but can't express experiential quality ("it should feel like an iPhone inbox"). A scalar metric captures "feel" but gives unconstrained agents too much freedom. Adversarial testing catches failures but at 15x token cost. The synthesis was a hybrid no source proposed: spec contracts for intent preservation + scalar user-eval ratchet for experiential quality + confidence-scored adversarial verification, with multi-agent cost kept below 3x by running adversarial checks only on flagged sprints.
Neither quality layer was right alone. The synthesis decomposed each system into its components and recombined: System A's compound step (3 parallel analyzers distilling every sprint into queryable structured knowledge), System A's confidence calibration (0.0-1.0 scores replacing binary pass/fail), and System B's hook enforcement (no agent discretion on phase ordering). The key insight: System A's 15-agent review army was overkill, but its structured output schema — tagged metadata with problem_type, root_cause, resolution_type — was the actual thing making knowledge compound. System B just needed the schema, not the army.
The entire eval tooling industry (Braintrust, LangSmith, Inspect AI) assumes RLHF/DPO — training reward models and fine-tuning model weights. But my agents are prompt-based. You can't adjust weights. The synthesis reframed the problem as a "linguistic optimization loop": the "reward model" is a distillation engine producing natural-language strategies, "gradient descent" is memory injection that changes behavior through context, the "learning rate" is the memory budget, and "regularization" is memory pruning. No single source contained this framing — it emerged from forcing all 16 inputs through the convergence map and finding that the standard paradigm was subtly wrong for prompt-based systems.
| Default Behavior | What Catches It | How |
|---|---|---|
| Five sources → five-section summary | Phase 1: cross-referencing at extraction | Every claim must tag what it agrees with and contradicts. Can't process sources independently. |
| "All sources agree" | Phase 2: zero-contradictions hard gate | If convergence map has no contradictions, stop. Re-read all sources for disagreements. ~60% of initial maps show false consensus. |
| Three options where two are strawmen | Phase 3: similarity collapse test | Can you merge any two hypotheses without losing meaning? If yes, they're the same idea. ~40% rejection rate. |
| "This could fail due to edge cases" | Phase 4: specificity test | Paste the failure mode under a different hypothesis. Still makes sense? It's generic filler. Rewrite with mechanism specific to this approach. |
| "Based on the research, I recommend..." [restates strongest source] | Phase 5 + Phase 7: novelty check + adversarial validator | Could this recommendation have been written without the analysis? If yes, the analysis was theater. Validator default stance is REJECT. |
| Researching the wrong question entirely | Stage 1: Think Through | Socratic exploration maps the full landscape before any research commitment. Surfaces unexamined assumptions in the question itself. |
Same model. Same context window. Radically different output. The variable isn't intelligence — it's cognitive structure.
Freeform Claude with "synthesize these findings" gives you a competent book report. Claude inside Research Synthesis — with mandatory cross-referencing, hard-gated contradiction detection, similarity-tested hypotheses, specificity-tested pre-mortems, and adversarial validation — produces genuine analytical work that surprises even me.
The frontier isn't better models. It's better cognitive architecture around the models we have.
And it compounds. Every synthesis adds to the 26,000+ document knowledge base. Future sessions find prior work and build on it. The Think Through skill references more prior explorations. The convergence maps get richer because more prior syntheses exist to cross-reference.
Research that makes itself better at research. That's the actual promise of building with LLMs.