Now in public beta

Wake up to a better system

Define your experiment in a Markdown file. AutoResearch runs 100 experiments overnight -- each one tested, scored, and git-committed. You review the results over coffee.

No credit card required. 3 free runs per month.

The problem

The Research Loop Tax

Every developer iterating on prompts, configs, or models runs the same grind: form hypothesis, edit, run, check, decide, repeat. A productive engineer manages 8-10 cycles per day. Most of that time is waiting, not thinking.

8-10

cycles/day (human)

100+

cycles/night (agent)

How it works

Three steps. Zero babysitting.

1

Upload program.md

Define your experiment loop in a simple Markdown file. Describe the goal, setup, constraints, and how each experiment should iterate.

2

Connect API Key (BYOK)

Bring your own key from Anthropic, OpenAI, or Google. Your key, your data. We never store or proxy your API credentials.

3

Wake up to results

AutoResearch runs 100+ experiments overnight. Every improvement is git-committed. Review a clean changelog in the morning.

program.md
# Prompt Optimization

## Goal
Optimize the system prompt to maximize classification accuracy.

## Constraints
- DO NOT MODIFY: eval.py, test_cases.json
- Keep the prompt under 2000 tokens
- Output must be a single category label

## Experiment Loop
1. Read the current prompt and recent scores
2. Run: python eval.py
3. Read metric: accuracy_pct from stdout
4. If improved: keep. If not: revert.

## Metric
metric_name: accuracy_pct
higher_is_better: true

Features

Everything you need. Nothing you don't.

Autonomous Overnight Loop

Set it and forget it. AutoResearch runs iterative experiments while you sleep, each building on the last successful result.

BYOK -- Your Key, Your Data

Your API key lives only in ephemeral container memory during the run. Never written to disk, never logged, never stored. Full data sovereignty.

Git-Backed Ratchet

Every improvement is committed to git. The ratchet only moves forward -- bad experiments are discarded, good ones are locked in.

Eval Harness Templates

Start fast with built-in eval templates for Python scripts, bash commands, LLM-as-judge, and HTTP endpoint scoring.

Real-Time SSE Streaming

Watch experiments run live in your browser. Scores, diffs, and status updates stream to your dashboard in real time.

Multi-Provider Support

Works with Anthropic Claude, OpenAI GPT, and Google Gemini. Switch providers per run without changing your program.

Open Source CLI

Install and run in 60 seconds

The experiment loop engine is open source. Run locally with your own API key, or use the cloud for parallel lanes and a dashboard.

terminal
$ pip install autoresearch
$ ars init --template prompt-opt

Scaffolds program.md + eval harness + target file

$ export ANTHROPIC_API_KEY=sk-ant-...
$ ars run --max-experiments 50
Exp 1/50 KEPT score=0.72 (+0.07) 3s Added structured output format
Exp 2/50 revert score=0.61 (-0.04) 2s Removed examples section
Exp 3/50 KEPT score=0.78 (+0.13) 4s Added chain-of-thought reasoning
...
50 experiments, 18 kept | Best: 0.89 (+0.24, +36.9%)
$ ars diff # see what changed
$ ars apply # write best version to file
View on GitHuborpip install autoresearch-cli

Real Results

58.8 → 82.5 in 15 experiments

We optimized a landing page copy overnight. The AI ran 15 experiments in 3 minutes for $0.02 in API credits.

Before

58.8

Generic, vague copy

After

82.5

Emotional storytelling

Improvement

+40.3%

15 experiments, ~$0.02

Before (score: 58.8)

TaskFlow Pro - Project Management Software

Manage your projects better with TaskFlow Pro. Our software helps teams stay organized and get things done on time.

Features:

  • Task management
  • Team collaboration
  • Time tracking
  • Reporting

Start your free trial today and see the difference.

After (score: 82.5)

Unlock Your Team's True Potential with TaskFlow Pro

Imagine stepping into a world where project chaos transforms into seamless collaboration. Visualize your team not just managing tasks, but thriving together.

What Makes TaskFlow Pro Your Best Choice?

  • Seamless Collaboration — real-time feedback empowers every voice
  • Dynamic Task Management — intuitive workflow that adapts
  • Data-Driven Insights — analytics for continuous improvement
  • Effortless Time Tracking — focus on what matters

Begin your free trial today and feel the renewed energy.

How the AI optimized it (15 experiments, 3 minutes)

KEPT58.8 → 72.5Added storytelling narrative
KEPT72.5 → 77.5Focused on emotional journey and empowerment
KEPT77.5 → 81.2Added vivid scenario showing team impact
revert8 failedStuck at local maximum — auto-switched to explore mode
KEPT81.2 → 82.5User-centric storytelling broke through plateau

Compare

AutoResearch vs. the alternatives

FeatureAutoResearch CloudWandBMLflow
Who runs experimentsAI agent (autonomous)Human (manual)Human (manual)
Overnight behaviorRuns 100+ experimentsWaits for youWaits for you
Input requiredprogram.md + API keyPython SDK integrationPython SDK integration
Version controlGit-native ratchetArtifact loggingArtifact logging
Setup time< 5 minutes30+ minutes1+ hour
Data privacy (BYOK)Self-hosted only
Eval harness built-in

Pricing

Simple, transparent pricing

Start free. Upgrade when your experiments need more power.

Free

For trying things out

$0 /month
  • 3 runs / month
  • 1 parallel lane
  • 25 experiments per run
  • Community support
Get Started Free

Starter

For individual developers

$9 /month
  • 20 runs / month
  • 2 parallel lanes
  • 100 experiments per run
  • Email support
  • Custom eval scripts
Start Building
Most popular

Pro

For serious experimenters

$29 /month
  • Unlimited runs
  • 4 parallel lanes
  • 500 experiments per run
  • Priority support
  • Custom eval scripts
  • SSE real-time streaming
  • Run history & analytics
Start Pro Trial

Team

For teams shipping together

$79 /month
  • Unlimited runs
  • 8 parallel lanes
  • 1000 experiments per run
  • Dedicated support
  • All Pro features
  • Team workspaces
  • SSO & audit logs
  • SLA guarantee
Contact Sales

FAQ

Common questions

What is program.md?

A simple Markdown file that describes your experiment loop: what to optimize, how to measure it, and what constraints to follow. Think of it as a recipe for the AI agent. You can write one from scratch or use our templates.

How much will it cost in API credits?

With Claude Haiku or GPT-4o-mini, a typical 100-experiment run costs $1-5 in API credits charged to your own key. With larger models like GPT-4o, expect $15-40. You control the model choice and experiment count before each run.

Is my API key safe?

Your key exists only in ephemeral container memory while your experiments run. It is never written to disk, never logged, and never stored in our database. When the run ends, the container is destroyed and the key is gone.

What if an experiment makes things worse?

That's the git ratchet: every improvement is committed, every regression is discarded. The system only moves forward. Even if 90 out of 100 experiments fail, you keep the 10 that worked.

Can I stop a run mid-way?

Yes. Hit Cancel and the run stops within seconds. All experiments completed before cancellation are preserved — you don't lose progress.