Autoresearch — Product Design Review
Problem Statement
AI coding agents (Claude Code, Codex CLI, Cursor, etc.) need autonomous iteration to improve codebases against measurable metrics. Today, agents either:
- Ask after every change — breaking flow, requiring human attention for mechanical decisions
- Use heavyweight orchestration — Python/Node scripts with complex dependency chains, slow startup, runtime dependencies
- Have no memory across turns — repeat failed experiments, lose context on compaction
There is no lightweight, compiled infrastructure that gives agents a tight modify→verify→keep/discard loop with git as memory, automatic rollback on failure, and escalation when stuck.
Solution
A single compiled Rust binary (about 3MB) that provides:
- Hook handler — sub-5ms responses for Claude Code’s plugin hook system (PreToolUse, PostToolUse, UserPromptSubmit, Stop, etc.)
- CLI operations —
init,verify,guard,decide,resume,health,progress,watch,lessons,handoff,exec, plusruntime run/start/status/supervise/stopandparallel prepare/run/closeout/cleanup - Agent packages — Claude plugin commands, Codex
.agentsskill/plugin package, OpenCode command/skill/helper-agent package, and shared markdown protocols for iteration loops, security audits, debugging, shipping, product improvement research, and more
The binary handles the mechanical infrastructure. The agent handles the intelligence. Clean separation.
Target Users
| User | Integration |
|---|---|
| Claude Code users | Installer builds the binary and installs the plugin hooks |
| Codex CLI users | $skill-installer skill plus local .agents/plugins/marketplace.json plugin package |
| OpenCode users | Generated .opencode/ commands, skill, and helper agent |
| Any LLM agent | CLI called directly, skill markdown parsed by agent |
Architecture
┌─────────────────┐ ┌──────────────┐ ┌───────────────┐
│ Agent (Claude/ │────▶│ autoresearch │────▶│ Git repo │
│ Codex/other) │ │ binary │ │ (experiments) │
└─────────────────┘ └──────────────┘ └───────────────┘
│ │
│ reads │ writes
▼ ▼
┌─────────────────┐ ┌──────────────────────┐
│ SKILL.md / │ │ autoresearch-results/ │
│ commands/*.md │ │ ├── results.tsv │
│ agent packages │ │ ├── state.json │
└─────────────────┘ │ ├── context.json │
│ ├── lessons.md │
│ ├── handoff.json │
│ ├── launch.json │
│ ├── runtime.json │
│ └── runtime.log │
└──────────────────────┘
Key Metrics
| Metric | Target | Rationale |
|---|---|---|
| Hook response latency | <5ms p99 | Hooks fire on every tool use; must be invisible |
| Binary size | <5MB | Single-file distribution, no extraction needed |
| Runtime dependencies | Zero | No Node, Python, Docker. Just the binary. |
| Cold start | <10ms | First invocation must feel instant |
| Memory usage | <5MB RSS | Runs alongside the agent, not competing for resources |
Non-Goals
- Not a replacement for the agent itself — the binary doesn’t make decisions about what to change. It handles verification, logging, rollback, and state management.
- Not a CI/CD system — it runs locally alongside the agent. The
execmode supports CI but is not a pipeline orchestrator. - Not a test framework — it calls your existing test/lint/build commands and parses their output.
- Not a package manager — it doesn’t manage dependencies, just detects dangerous ones during security audits.
Modes
| Mode | Purpose |
|---|---|
| Core loop | Iterate against any numeric metric |
| Debug | Scientific bug hunting with hypothesis testing |
| Fix | Crush errors one-by-one until zero |
| Security | STRIDE + OWASP audit with red-team personas |
| Scenario | Edge case generation across 12 dimensions |
| Predict | Multi-persona expert debate |
| Reason | Adversarial refinement with blind judges |
| Probe | Requirements interrogation until saturation |
| Learn | Auto-generate documentation |
| Ship | 8-phase ship workflow |
| Improve | Research ICP needs and generate product improvement PRDs |
| Evals | Analyze iteration results |
| Plan | Convert goal → validated config |
Success Criteria
- Agent can iterate 25+ times without human intervention
- Failed experiments are automatically reverted (zero pollution)
- Cross-session memory via lessons.md survives compaction
- Hook latency is imperceptible to the agent/user
- Background
autoresearch runtime runcan relaunch Codex turns without corrupting artifacts - Parallel worker closeout produces one authoritative retained result after verification
- Installation is one command for Claude, Codex, and OpenCode paths