Define your experiment in a Markdown file. AutoResearch runs 100 experiments overnight -- each one tested, scored, and git-committed. You review the results over coffee.
No credit card required. 3 free runs per month.
The problem
Every developer iterating on prompts, configs, or models runs the same grind: form hypothesis, edit, run, check, decide, repeat. A productive engineer manages 8-10 cycles per day. Most of that time is waiting, not thinking.
8-10
cycles/day (human)
100+
cycles/night (agent)
How it works
Define your experiment loop in a simple Markdown file. Describe the goal, setup, constraints, and how each experiment should iterate.
Bring your own key from Anthropic, OpenAI, or Google. Your key, your data. We never store or proxy your API credentials.
AutoResearch runs 100+ experiments overnight. Every improvement is git-committed. Review a clean changelog in the morning.
# Prompt Optimization
## Goal
Optimize the system prompt to maximize classification accuracy.
## Constraints
- DO NOT MODIFY: eval.py, test_cases.json
- Keep the prompt under 2000 tokens
- Output must be a single category label
## Experiment Loop
1. Read the current prompt and recent scores
2. Run: python eval.py
3. Read metric: accuracy_pct from stdout
4. If improved: keep. If not: revert.
## Metric
metric_name: accuracy_pct
higher_is_better: trueFeatures
Set it and forget it. AutoResearch runs iterative experiments while you sleep, each building on the last successful result.
Your API key lives only in ephemeral container memory during the run. Never written to disk, never logged, never stored. Full data sovereignty.
Every improvement is committed to git. The ratchet only moves forward -- bad experiments are discarded, good ones are locked in.
Start fast with built-in eval templates for Python scripts, bash commands, LLM-as-judge, and HTTP endpoint scoring.
Watch experiments run live in your browser. Scores, diffs, and status updates stream to your dashboard in real time.
Works with Anthropic Claude, OpenAI GPT, and Google Gemini. Switch providers per run without changing your program.
Open Source CLI
The experiment loop engine is open source. Run locally with your own API key, or use the cloud for parallel lanes and a dashboard.
Scaffolds program.md + eval harness + target file
Compare
| Feature | AutoResearch Cloud | WandB | MLflow |
|---|---|---|---|
| Who runs experiments | AI agent (autonomous) | Human (manual) | Human (manual) |
| Overnight behavior | Runs 100+ experiments | Waits for you | Waits for you |
| Input required | program.md + API key | Python SDK integration | Python SDK integration |
| Version control | Git-native ratchet | Artifact logging | Artifact logging |
| Setup time | < 5 minutes | 30+ minutes | 1+ hour |
| Data privacy (BYOK) | Self-hosted only | ||
| Eval harness built-in |
Pricing
Start free. Upgrade when your experiments need more power.
For trying things out
For individual developers
For serious experimenters
For teams shipping together
FAQ
A simple Markdown file that describes your experiment loop: what to optimize, how to measure it, and what constraints to follow. Think of it as a recipe for the AI agent. You can write one from scratch or use our templates.
With Claude Haiku or GPT-4o-mini, a typical 100-experiment run costs $1-5 in API credits charged to your own key. With larger models like GPT-4o, expect $15-40. You control the model choice and experiment count before each run.
Your key exists only in ephemeral container memory while your experiments run. It is never written to disk, never logged, and never stored in our database. When the run ends, the container is destroyed and the key is gone.
That's the git ratchet: every improvement is committed, every regression is discarded. The system only moves forward. Even if 90 out of 100 experiments fail, you keep the 10 that worked.
Yes. Hit Cancel and the run stops within seconds. All experiments completed before cancellation are preserved — you don't lose progress.