Join our GTM Engineering event with OpenAI & Vultr
Scale Intelligence

Research · Auto Marketing Bench

Retrieval for action selection,
not just answers.

Long-running agents don't just retrieve facts. They retrieve prior decisions and outcomes to choose what to do next. Auto Marketing Bench measures whether retrieved memory actually improves the next action.

Abstract

Most retrieval evaluation asks whether a search returns the right facts. Long-running agents need something different. They retrieve prior decisions and outcomes to choose what to do next. We introduce a benchmark that measures whether retrieved memory actually improves the next action. Five baselines, six tasks, 1,113 records of real B2B content with z-scored[1][1]Each KPI normalized to standard deviations from the account's own baseline. +1.5σ means 1.5 standard deviations above typical. KPIs. Headline: outcome-aware retrieval doubles relevance over dense, but is structurally failure-blind. Only contrastive surfaces both.

What gets scored

Two things, scored separately.

Retrieval quality. Did the search find the relevant past records? A recordis one social post with its z-scored KPIs attached. The “right” records for a given context are the ones that worked, or failed, in similar past situations. Scored with nDCG@10. Relevant records showing up in the top ten of the ranked list.

Decision quality. Given those retrieved records, did the agent pick a better action? An action is which post to ship (from two candidates, or from N), which recipe to use (one of 120 reusable templates that cover 81% of records), or how to revise a recipe. Scored per task: pairwise accuracy, recipe top-1 hit, ranking regret against the oracle.

The kicker: a method can win on retrieval and still pick badly. Outcome-aware doubles dense's nDCG, but its pairwise prediction is near chance (0.495). The same signal gets recipe selection right 67% of the time, where guessing would land 15%. The benchmark exists to measure that gap.

The agent loop

01
Context
02
Retrieve
03
Action
04
Outcome
05
Memory
loop

Long-running agents act, observe an outcome, and update memory. The benchmark scores retrieval by what it does to the next action.

1,113
action-outcome records
~4,400
task instances
6
benchmark tasks
5
retrieval baselines

The shift

Standard RAG evaluates answers.
Agents need memory.

In question answering, retrieval supplies facts. In long-running agentic workflows, retrieval supplies the memory substrate for future decisions. Prior actions, outcomes, failures, constraints, and feedback. Auto Marketing Bench scores retrieval by its effect on the next decision, not just topical relevance.

01

Action-outcome records

The retrieval target is not a knowledge passage. It is a prior decision with a normalized, observed KPI outcome.

02

Decision as output

The downstream task is selecting, ranking, or revising an action. Not generating an answer string.

03

Temporal & held-out splits

Memory available before the decision point only. Account-and-platform-held-out splits stress real deployment.

Six tasks

From evidence retrieval
to online improvement.

Five offline, reproducible tasks plus one optional online protocol for external validity. Each task isolates a different failure mode of retrieval-augmented decision-making.

01

Evidence retrieval

Given a decision context, return prior action-outcome records that should inform the choice. Scored with nDCG, MRR, and success/failure recall.

02

Pairwise prediction

Given two candidate actions and retrieved memory, predict which one will outperform on the target KPI.

03

Candidate ranking

Rank a generated candidate set by predicted KPI. Reported with nDCG@k, top-hit@k, and regret vs. oracle.

04

Recipe selection

Pick a structured content recipe from a library of 120 templates, conditioned on context and retrieved memory.

05

Recipe revision

Edit a recipe to lift predicted outcome. Compared against random, frequency, retrieval-score, and prior-oracle.

06

Online sequential improvement

Optional, human-reviewed online loop where the agent acts, observes feedback, and updates memory across rounds.

Pilot dataset

Real records, normalized outcomes,
leakage-safe splits.

Public posts from B2B SaaS, AI dev tools, DTC consumer brands, and fintech accounts on Twitter and LinkedIn. Normalized against per-account / platform / KPI baselines and labeled success / neutral / failure.

1,113
records
across 18 accounts
2
platforms
Twitter · LinkedIn
4
target KPIs
engagement · reply · like · amplification rate
120
recipe templates
80.7% record coverage
779 / 166 / 168
train · dev · test
temporal split
3
held-out splits
account · platform-twitter · platform-linkedin

Outcome label distribution

Total: 1113 records.

What one record looks like

Twitter·B2B SaaS <redacted>·2024-Q3
Success

“We just shipped a 2x speedup on our retrieval API. No model changes, no new infra. Just a smarter chunking strategy. Thread on what we learned ↓”

Engagement
+1.42σ
Reply
+0.87σ
Like
+1.55σ
Amplification
+1.91σ

Each record is a post plus its z-scored KPIs against per-account baselines, labeled success / neutral / failure. The retrieval target.

Baselines

Five retrieval methods,
four KPIs, three findings.

Across-KPI mean of the pilot baselines. Outcome-aware retrieval wins on relevance[2][2]nDCG@10: Normalized Discounted Cumulative Gain at rank 10. A standard ranking metric. Higher = better-ordered results. but is failure-blind. Contrastive is the only method that surfaces both successes and failures at non-trivial recall.

The five methods, in plain English

  • BM25. Keyword search. Matches the words from the query to the words in past records, weighing rare words more than common ones. The "find-on-page across everything" approach.
  • Dense. Semantic search. Every record and every query gets turned into a list of numbers that captures meaning. Close matches come back even when none of the literal words overlap.
  • Hybrid. Blends BM25 and dense. Keyword signal plus semantic signal, combined into a single ranking.
  • Outcome-aware. Semantic search filtered to past successes only. Wins on relevance because it skips the noise. The trade: it can never surface a failure, so an agent using it can’t learn what to avoid.
  • Contrastive. A learned retrieval model trained to keep successes and failures in separate regions of the search space. The only method that returns both at non-trivial recall.

Why contrastive

Wins and losses live in different regions.

Most retrievers cluster by topic. Wins next to wins, losses next to losses. Contrastive learns to keep them in separate regions of the search space, so a single query can pull from either pile.

  • Successes
  • Failures
  • Contrastive boundary

nDCG@10

Higher is better. Outcome-aware leads.

  • BM25
    0.188
  • Dense
    0.271
  • Hybrid
    0.243
  • Outcome-aware
    0.500
  • Contrastive
    0.459

Failure recall@10

Higher means the method can surface failures, not just wins.

  • BM25
    0.030
  • Dense
    0.031
  • Hybrid
    0.039
  • Outcome-aware
    0.000
  • Contrastive
    0.068

Full numbers

MethodnDCG@10MRRSuccess recall@10Failure recall@10
BM250.1880.4160.0240.030
Dense0.2710.5470.0340.031
Hybrid0.2430.5430.0210.039
Outcome-aware0.5000.6840.1250.000
Contrastive0.4590.6700.0680.068

Across-KPI mean over engagement, reply, like, and amplification rate. Best per column highlighted.

F1

Outcome-aware ≈ 2× dense

Outcome-aware retrieval roughly doubles nDCG@10 over dense retrieval (0.500 vs 0.271 mean), but reaches zero failure-recall@10. Structurally failure-blind.

F2

Contrastive surfaces both

Contrastive retrieval is the only method that returns both successes and failures at non-trivial recall while keeping strong overall nDCG.

F3

Retrieval ≠ pointwise decision

Naive retrieval-vote pairwise prediction is near chance (mean 0.495). The same signal drives recipe selection to top-1 hit 0.667 vs. 0.155 random. The gap is the object of study.

Pipeline

Qdrant-backed memory.
Sparse, dense, hybrid, outcome-aware.

Records flow through scrape, normalize, split, index. The same Qdrant collection serves all five retrieval baselines via sparse, dense, and payload-conditioned queries.

01

Scrape

Public posts across 18 accounts, 2 platforms.

02

Normalize

Per-account/platform/KPI z-scores. Success / neutral / failure labels.

03

Split

Temporal + account + platform held-out.

04

Index

Qdrant: BM25 sparse + dense + hybrid + payload filters.

05

Evaluate

Retrieval, pairwise, ranking, selection, revision.

Manuscript

Retrieval for
Action Selection.

Evaluating Outcome-Conditioned Memory in Long-Running Agents. SIGIR-style draft covering problem formulation, six tasks, dataset, baselines, and the gap between retrieval quality and pointwise decision quality.

  • StatusPilot release · scaffold
  • DomainGTM content strategy
  • Records1,113 across 18 accounts
  • SplitsTemporal · account · platform
  • IndexQdrant (sparse + dense + hybrid)
  • Tasks6 (5 offline, 1 online)