Research · Auto Marketing Bench

Retrieval for action selection,
not just answers.

Long-running agents don't just retrieve facts. They retrieve prior decisions and outcomes to choose what to do next. Auto Marketing Bench measures whether retrieved memory actually improves the next action.

Read the paper View the code

Abstract

Most retrieval evaluation asks whether a search returns the right facts. Long-running agents need something different. They retrieve prior decisions and outcomes to choose what to do next. We introduce a benchmark that measures whether retrieved memory actually improves the next action. Five baselines, six tasks, 1,113 records of real B2B content with z-scored^[1][1]Each KPI normalized to standard deviations from the account's own baseline. +1.5σ means 1.5 standard deviations above typical. KPIs. Headline: outcome-aware retrieval doubles relevance over dense, but is structurally failure-blind. Only contrastive surfaces both.

What gets scored

Two things, scored separately.

Retrieval quality. Did the search find the relevant past records? A recordis one social post with its z-scored KPIs attached. The “right” records for a given context are the ones that worked, or failed, in similar past situations. Scored with nDCG@10. Relevant records showing up in the top ten of the ranked list.

Decision quality. Given those retrieved records, did the agent pick a better action? An action is which post to ship (from two candidates, or from N), which recipe to use (one of 120 reusable templates that cover 81% of records), or how to revise a recipe. Scored per task: pairwise accuracy, recipe top-1 hit, ranking regret against the oracle.

The kicker: a method can win on retrieval and still pick badly. Outcome-aware doubles dense's nDCG, but its pairwise prediction is near chance (0.495). The same signal gets recipe selection right 67% of the time, where guessing would land 15%. The benchmark exists to measure that gap.

The agent loop

Context

Retrieve

Action

Outcome

Memory

↺loop

Long-running agents act, observe an outcome, and update memory. The benchmark scores retrieval by what it does to the next action.

1,113

action-outcome records

~4,400

task instances

benchmark tasks

retrieval baselines

The shift

Standard RAG evaluates answers.
Agents need memory.

In question answering, retrieval supplies facts. In long-running agentic workflows, retrieval supplies the memory substrate for future decisions. Prior actions, outcomes, failures, constraints, and feedback. Auto Marketing Bench scores retrieval by its effect on the next decision, not just topical relevance.

Action-outcome records

The retrieval target is not a knowledge passage. It is a prior decision with a normalized, observed KPI outcome.

Decision as output

The downstream task is selecting, ranking, or revising an action. Not generating an answer string.

Temporal & held-out splits

Memory available before the decision point only. Account-and-platform-held-out splits stress real deployment.

Six tasks

From evidence retrieval
to online improvement.

Five offline, reproducible tasks plus one optional online protocol for external validity. Each task isolates a different failure mode of retrieval-augmented decision-making.

Evidence retrieval

Given a decision context, return prior action-outcome records that should inform the choice. Scored with nDCG, MRR, and success/failure recall.

Pairwise prediction

Given two candidate actions and retrieved memory, predict which one will outperform on the target KPI.

Candidate ranking

Rank a generated candidate set by predicted KPI. Reported with nDCG@k, top-hit@k, and regret vs. oracle.

Recipe selection

Pick a structured content recipe from a library of 120 templates, conditioned on context and retrieved memory.

Recipe revision

Edit a recipe to lift predicted outcome. Compared against random, frequency, retrieval-score, and prior-oracle.

Online sequential improvement

Optional, human-reviewed online loop where the agent acts, observes feedback, and updates memory across rounds.

Pilot dataset

Real records, normalized outcomes,
leakage-safe splits.

Public posts from B2B SaaS, AI dev tools, DTC consumer brands, and fintech accounts on Twitter and LinkedIn. Normalized against per-account / platform / KPI baselines and labeled success / neutral / failure.

1,113

records

across 18 accounts

platforms

Twitter · LinkedIn

target KPIs

engagement · reply · like · amplification rate

120

recipe templates

80.7% record coverage

779 / 166 / 168

train · dev · test

temporal split

held-out splits

account · platform-twitter · platform-linkedin

Outcome label distribution

352Success

464Neutral

297Failure

Total: 1113 records.

What one record looks like

Twitter·B2B SaaS <redacted>·2024-Q3

Success

“We just shipped a 2x speedup on our retrieval API. No model changes, no new infra. Just a smarter chunking strategy. Thread on what we learned ↓”

Engagement

+1.42σ

+0.87σ

+1.55σ

Amplification

+1.91σ

Each record is a post plus its z-scored KPIs against per-account baselines, labeled success / neutral / failure. The retrieval target.

Baselines

Five retrieval methods,
four KPIs, three findings.

Across-KPI mean of the pilot baselines. Outcome-aware retrieval wins on relevance^[2][2]nDCG@10: Normalized Discounted Cumulative Gain at rank 10. A standard ranking metric. Higher = better-ordered results. but is failure-blind. Contrastive is the only method that surfaces both successes and failures at non-trivial recall.

The five methods, in plain English

BM25. Keyword search. Matches the words from the query to the words in past records, weighing rare words more than common ones. The "find-on-page across everything" approach.
Dense. Semantic search. Every record and every query gets turned into a list of numbers that captures meaning. Close matches come back even when none of the literal words overlap.
Hybrid. Blends BM25 and dense. Keyword signal plus semantic signal, combined into a single ranking.
Outcome-aware. Semantic search filtered to past successes only. Wins on relevance because it skips the noise. The trade: it can never surface a failure, so an agent using it can’t learn what to avoid.
Contrastive. A learned retrieval model trained to keep successes and failures in separate regions of the search space. The only method that returns both at non-trivial recall.

Why contrastive

Wins and losses live in different regions.

Most retrievers cluster by topic. Wins next to wins, losses next to losses. Contrastive learns to keep them in separate regions of the search space, so a single query can pull from either pile.

Successes
Failures
Contrastive boundary

nDCG@10

Higher is better. Outcome-aware leads.

BM25
0.188
Dense
0.271
Hybrid
0.243
Outcome-aware
0.500
Contrastive
0.459

Failure recall@10

Higher means the method can surface failures, not just wins.

BM25
0.030
Dense
0.031
Hybrid
0.039
Outcome-aware
0.000
Contrastive
0.068

Full numbers

Method	nDCG@10	MRR	Success recall@10	Failure recall@10
BM25	0.188	0.416	0.024	0.030
Dense	0.271	0.547	0.034	0.031
Hybrid	0.243	0.543	0.021	0.039
Outcome-aware	0.500	0.684	0.125	0.000
Contrastive	0.459	0.670	0.068	0.068

Across-KPI mean over engagement, reply, like, and amplification rate. Best per column highlighted.

Outcome-aware ≈ 2× dense

Outcome-aware retrieval roughly doubles nDCG@10 over dense retrieval (0.500 vs 0.271 mean), but reaches zero failure-recall@10. Structurally failure-blind.

Contrastive surfaces both

Contrastive retrieval is the only method that returns both successes and failures at non-trivial recall while keeping strong overall nDCG.

Retrieval ≠ pointwise decision

Naive retrieval-vote pairwise prediction is near chance (mean 0.495). The same signal drives recipe selection to top-1 hit 0.667 vs. 0.155 random. The gap is the object of study.

Pipeline

Qdrant-backed memory.
Sparse, dense, hybrid, outcome-aware.

Records flow through scrape, normalize, split, index. The same Qdrant collection serves all five retrieval baselines via sparse, dense, and payload-conditioned queries.

Scrape

Public posts across 18 accounts, 2 platforms.

Normalize

Per-account/platform/KPI z-scores. Success / neutral / failure labels.

Split

Temporal + account + platform held-out.

Index

Qdrant: BM25 sparse + dense + hybrid + payload filters.

Evaluate

Retrieval, pairwise, ranking, selection, revision.

Manuscript