Now in public beta

Wake up to a better system

Define your experiment in a Markdown file. AutoResearch runs 100 experiments overnight -- each one tested, scored, and git-committed. You review the results over coffee.

Get Started Free View on GitHub

No credit card required. 3 free runs per month.

The problem

The Research Loop Tax

Every developer iterating on prompts, configs, or models runs the same grind: form hypothesis, edit, run, check, decide, repeat. A productive engineer manages 8-10 cycles per day. Most of that time is waiting, not thinking.

8-10

cycles/day (human)

100+

cycles/night (agent)

How it works

Three steps. Zero babysitting.

Upload program.md

Define your experiment loop in a simple Markdown file. Describe the goal, setup, constraints, and how each experiment should iterate.

Connect API Key (BYOK)

Bring your own key from Anthropic, OpenAI, or Google. Your key, your data. We never store or proxy your API credentials.

Wake up to results

AutoResearch runs 100+ experiments overnight. Every improvement is git-committed. Review a clean changelog in the morning.

program.md

# Prompt Optimization

## Goal
Optimize the system prompt to maximize classification accuracy.

## Constraints
- DO NOT MODIFY: eval.py, test_cases.json
- Keep the prompt under 2000 tokens
- Output must be a single category label

## Experiment Loop
1. Read the current prompt and recent scores
2. Run: python eval.py
3. Read metric: accuracy_pct from stdout
4. If improved: keep. If not: revert.

## Metric
metric_name: accuracy_pct
higher_is_better: true

Features

Everything you need. Nothing you don't.

Autonomous Overnight Loop

Set it and forget it. AutoResearch runs iterative experiments while you sleep, each building on the last successful result.

BYOK -- Your Key, Your Data

Your API key lives only in ephemeral container memory during the run. Never written to disk, never logged, never stored. Full data sovereignty.

Git-Backed Ratchet

Every improvement is committed to git. The ratchet only moves forward -- bad experiments are discarded, good ones are locked in.

Eval Harness Templates

Start fast with built-in eval templates for Python scripts, bash commands, LLM-as-judge, and HTTP endpoint scoring.

Real-Time SSE Streaming

Watch experiments run live in your browser. Scores, diffs, and status updates stream to your dashboard in real time.

Multi-Provider Support

Works with Anthropic Claude, OpenAI GPT, and Google Gemini. Switch providers per run without changing your program.

Open Source CLI

Install and run in 60 seconds

The experiment loop engine is open source. Run locally with your own API key, or use the cloud for parallel lanes and a dashboard.

terminal

$ pip install autoresearch

$ ars init --template prompt-opt

Scaffolds program.md + eval harness + target file

$ export ANTHROPIC_API_KEY=sk-ant-...

$ ars run --max-experiments 50

Exp 1/50 KEPT score=0.72 (+0.07) 3s Added structured output format

Exp 2/50 revert score=0.61 (-0.04) 2s Removed examples section

Exp 3/50 KEPT score=0.78 (+0.13) 4s Added chain-of-thought reasoning

...

50 experiments, 18 kept | Best: 0.89 (+0.24, +36.9%)

$ ars diff # see what changed

$ ars apply # write best version to file

View on GitHuborpip install autoresearch-cli

Real Results

58.8 → 82.5 in 15 experiments

We optimized a landing page copy overnight. The AI ran 15 experiments in 3 minutes for $0.02 in API credits.

Before

58.8

Generic, vague copy

After

82.5

Emotional storytelling

Improvement

+40.3%

15 experiments, ~$0.02

Before (score: 58.8)

TaskFlow Pro - Project Management Software

Manage your projects better with TaskFlow Pro. Our software helps teams stay organized and get things done on time.

Features:

Task management
Team collaboration
Time tracking
Reporting

Start your free trial today and see the difference.

After (score: 82.5)

Unlock Your Team's True Potential with TaskFlow Pro

Imagine stepping into a world where project chaos transforms into seamless collaboration. Visualize your team not just managing tasks, but thriving together.

What Makes TaskFlow Pro Your Best Choice?

Seamless Collaboration — real-time feedback empowers every voice
Dynamic Task Management — intuitive workflow that adapts
Data-Driven Insights — analytics for continuous improvement
Effortless Time Tracking — focus on what matters

Begin your free trial today and feel the renewed energy.

How the AI optimized it (15 experiments, 3 minutes)

KEPT58.8 → 72.5Added storytelling narrative

KEPT72.5 → 77.5Focused on emotional journey and empowerment

KEPT77.5 → 81.2Added vivid scenario showing team impact

revert8 failedStuck at local maximum — auto-switched to explore mode

KEPT81.2 → 82.5User-centric storytelling broke through plateau

Try it yourself — free

Compare

AutoResearch vs. the alternatives

Feature	AutoResearch Cloud	WandB	MLflow
Who runs experiments	AI agent (autonomous)	Human (manual)	Human (manual)
Overnight behavior	Runs 100+ experiments	Waits for you	Waits for you
Input required	program.md + API key	Python SDK integration	Python SDK integration
Version control	Git-native ratchet	Artifact logging	Artifact logging
Setup time	< 5 minutes	30+ minutes	1+ hour
Data privacy (BYOK)			Self-hosted only
Eval harness built-in

Start your first overnight run — free

Pricing

Simple, transparent pricing

Start free. Upgrade when your experiments need more power.

Free

For trying things out

$0 /month

3 runs / month
1 parallel lane
25 experiments per run
Community support

Get Started Free

Starter

For individual developers

$9 /month

20 runs / month
2 parallel lanes
100 experiments per run
Email support
Custom eval scripts

Start Building

Pro

For serious experimenters

$29 /month

Unlimited runs
4 parallel lanes
500 experiments per run
Priority support
Custom eval scripts
SSE real-time streaming
Run history & analytics

Start Pro Trial

Team

For teams shipping together

$79 /month

Unlimited runs
8 parallel lanes
1000 experiments per run
Dedicated support
All Pro features
Team workspaces
SSO & audit logs
SLA guarantee

Contact Sales

FAQ

Common questions

What is program.md?

A simple Markdown file that describes your experiment loop: what to optimize, how to measure it, and what constraints to follow. Think of it as a recipe for the AI agent. You can write one from scratch or use our templates.

How much will it cost in API credits?

With Claude Haiku or GPT-4o-mini, a typical 100-experiment run costs $1-5 in API credits charged to your own key. With larger models like GPT-4o, expect $15-40. You control the model choice and experiment count before each run.

Is my API key safe?

Your key exists only in ephemeral container memory while your experiments run. It is never written to disk, never logged, and never stored in our database. When the run ends, the container is destroyed and the key is gone.

What if an experiment makes things worse?

That's the git ratchet: every improvement is committed, every regression is discarded. The system only moves forward. Even if 90 out of 100 experiments fail, you keep the 10 that worked.

Can I stop a run mid-way?

Yes. Hit Cancel and the run stops within seconds. All experiments completed before cancellation are preserved — you don't lose progress.