Now in public beta

Wake up to a better system

Define your experiment in a Markdown file. AutoResearch runs 100 experiments overnight -- each one tested, scored, and git-committed. You review the results over coffee.

No credit card required. 3 free runs per month.

The problem

The Research Loop Tax

Every developer iterating on prompts, configs, or models runs the same grind: form hypothesis, edit, run, check, decide, repeat. A productive engineer manages 8-10 cycles per day. Most of that time is waiting, not thinking.

8-10

cycles/day (human)

100+

cycles/night (agent)

How it works

Three steps. Zero babysitting.

1

Upload program.md

Define your experiment loop in a simple Markdown file. Describe the goal, setup, constraints, and how each experiment should iterate.

2

Connect API Key (BYOK)

Bring your own key from Anthropic, OpenAI, or Google. Your key, your data. We never store or proxy your API credentials.

3

Wake up to results

AutoResearch runs 100+ experiments overnight. Every improvement is git-committed. Review a clean changelog in the morning.

program.md
# Prompt Optimization

## Goal
Optimize the system prompt to maximize classification accuracy.

## Constraints
- DO NOT MODIFY: eval.py, test_cases.json
- Keep the prompt under 2000 tokens
- Output must be a single category label

## Experiment Loop
1. Read the current prompt and recent scores
2. Run: python eval.py
3. Read metric: accuracy_pct from stdout
4. If improved: keep. If not: revert.

## Metric
metric_name: accuracy_pct
higher_is_better: true

Features

Everything you need. Nothing you don't.

Autonomous Overnight Loop

Set it and forget it. AutoResearch runs iterative experiments while you sleep, each building on the last successful result.

BYOK -- Your Key, Your Data

Your API key lives only in ephemeral container memory during the run. Never written to disk, never logged, never stored. Full data sovereignty.

Git-Backed Ratchet

Every improvement is committed to git. The ratchet only moves forward -- bad experiments are discarded, good ones are locked in.

Eval Harness Templates

Start fast with built-in eval templates for Python scripts, bash commands, LLM-as-judge, and HTTP endpoint scoring.

Real-Time SSE Streaming

Watch experiments run live in your browser. Scores, diffs, and status updates stream to your dashboard in real time.

Multi-Provider Support

Works with Anthropic Claude, OpenAI GPT, and Google Gemini. Switch providers per run without changing your program.

Open Source CLI

Install and run in 60 seconds

The experiment loop engine is open source. Run locally with your own API key, or use the cloud for parallel lanes and a dashboard.

terminal
$ pip install autoresearch-oss
$ ars init --template prompt-opt

Scaffolds program.md + eval harness + target file

$ export ANTHROPIC_API_KEY=sk-ant-...
$ ars run --max-experiments 50
Exp 1/50 KEPT score=0.72 (+0.07) 3s Added structured output format
Exp 2/50 revert score=0.61 (-0.04) 2s Removed examples section
Exp 3/50 KEPT score=0.78 (+0.13) 4s Added chain-of-thought reasoning
...
50 experiments, 18 kept | Best: 0.89 (+0.24, +36.9%)
$ ars diff # see what changed
$ ars apply # write best version to file
View on GitHuborpip install autoresearch-oss

Compare

AutoResearch vs. the alternatives

FeatureAutoResearch CloudWandBMLflow
Who runs experimentsAI agent (autonomous)Human (manual)Human (manual)
Overnight behaviorRuns 100+ experimentsWaits for youWaits for you
Input requiredprogram.md + API keyPython SDK integrationPython SDK integration
Version controlGit-native ratchetArtifact loggingArtifact logging
Setup time< 5 minutes30+ minutes1+ hour
Data privacy (BYOK)Self-hosted only
Eval harness built-in

Pricing

Simple, transparent pricing

Start free. Upgrade when your experiments need more power.

Free

For trying things out

$0 /month
  • 3 runs / month
  • 1 parallel lane
  • 25 experiments per run
  • Community support
Get Started Free

Starter

For individual developers

$9 /month
  • 20 runs / month
  • 2 parallel lanes
  • 100 experiments per run
  • Email support
  • Custom eval scripts
Start Building
Most popular

Pro

For serious experimenters

$29 /month
  • Unlimited runs
  • 4 parallel lanes
  • 500 experiments per run
  • Priority support
  • Custom eval scripts
  • SSE real-time streaming
  • Run history & analytics
Start Pro Trial

Team

For teams shipping together

$79 /month
  • Unlimited runs
  • 8 parallel lanes
  • 1000 experiments per run
  • Dedicated support
  • All Pro features
  • Team workspaces
  • SSO & audit logs
  • SLA guarantee
Contact Sales

FAQ

Common questions

What is program.md?

A simple Markdown file that describes your experiment loop: what to optimize, how to measure it, and what constraints to follow. Think of it as a recipe for the AI agent. You can write one from scratch or use our templates.

How much will it cost in API credits?

With Claude Haiku or GPT-4o-mini, a typical 100-experiment run costs $1-5 in API credits charged to your own key. With larger models like GPT-4o, expect $15-40. You control the model choice and experiment count before each run.

Is my API key safe?

Your key exists only in ephemeral container memory while your experiments run. It is never written to disk, never logged, and never stored in our database. When the run ends, the container is destroyed and the key is gone.

What if an experiment makes things worse?

That's the git ratchet: every improvement is committed, every regression is discarded. The system only moves forward. Even if 90 out of 100 experiments fail, you keep the 10 that worked.

Can I stop a run mid-way?

Yes. Hit Cancel and the run stops within seconds. All experiments completed before cancellation are preserved — you don't lose progress.