Research · Auto Marketing Bench
Retrieval for action selection,
not just answers.
Long-running agents don't just retrieve facts. They retrieve prior decisions and outcomes to choose what to do next. Auto Marketing Bench measures whether retrieved memory actually improves the next action.
Abstract
Most retrieval evaluation asks whether a search returns the right facts. Long-running agents need something different. They retrieve prior decisions and outcomes to choose what to do next. We introduce a benchmark that measures whether retrieved memory actually improves the next action. Five baselines, six tasks, 1,113 records of real B2B content with z-scored[1][1]Each KPI normalized to standard deviations from the account's own baseline. +1.5σ means 1.5 standard deviations above typical. KPIs. Headline: outcome-aware retrieval doubles relevance over dense, but is structurally failure-blind. Only contrastive surfaces both.
What gets scored
Two things, scored separately.
Retrieval quality. Did the search find the relevant past records? A recordis one social post with its z-scored KPIs attached. The “right” records for a given context are the ones that worked, or failed, in similar past situations. Scored with nDCG@10. Relevant records showing up in the top ten of the ranked list.
Decision quality. Given those retrieved records, did the agent pick a better action? An action is which post to ship (from two candidates, or from N), which recipe to use (one of 120 reusable templates that cover 81% of records), or how to revise a recipe. Scored per task: pairwise accuracy, recipe top-1 hit, ranking regret against the oracle.
The kicker: a method can win on retrieval and still pick badly. Outcome-aware doubles dense's nDCG, but its pairwise prediction is near chance (0.495). The same signal gets recipe selection right 67% of the time, where guessing would land 15%. The benchmark exists to measure that gap.
The agent loop
Long-running agents act, observe an outcome, and update memory. The benchmark scores retrieval by what it does to the next action.
The shift
Standard RAG evaluates answers.
Agents need memory.
In question answering, retrieval supplies facts. In long-running agentic workflows, retrieval supplies the memory substrate for future decisions. Prior actions, outcomes, failures, constraints, and feedback. Auto Marketing Bench scores retrieval by its effect on the next decision, not just topical relevance.
01
Action-outcome records
The retrieval target is not a knowledge passage. It is a prior decision with a normalized, observed KPI outcome.
02
Decision as output
The downstream task is selecting, ranking, or revising an action. Not generating an answer string.
03
Temporal & held-out splits
Memory available before the decision point only. Account-and-platform-held-out splits stress real deployment.
Six tasks
From evidence retrieval
to online improvement.
Five offline, reproducible tasks plus one optional online protocol for external validity. Each task isolates a different failure mode of retrieval-augmented decision-making.
01
Evidence retrieval
Given a decision context, return prior action-outcome records that should inform the choice. Scored with nDCG, MRR, and success/failure recall.
02
Pairwise prediction
Given two candidate actions and retrieved memory, predict which one will outperform on the target KPI.
03
Candidate ranking
Rank a generated candidate set by predicted KPI. Reported with nDCG@k, top-hit@k, and regret vs. oracle.
04
Recipe selection
Pick a structured content recipe from a library of 120 templates, conditioned on context and retrieved memory.
05
Recipe revision
Edit a recipe to lift predicted outcome. Compared against random, frequency, retrieval-score, and prior-oracle.
06
Online sequential improvement
Optional, human-reviewed online loop where the agent acts, observes feedback, and updates memory across rounds.
Pilot dataset
Real records, normalized outcomes,
leakage-safe splits.
Public posts from B2B SaaS, AI dev tools, DTC consumer brands, and fintech accounts on Twitter and LinkedIn. Normalized against per-account / platform / KPI baselines and labeled success / neutral / failure.
Outcome label distribution
Total: 1113 records.
What one record looks like
“We just shipped a 2x speedup on our retrieval API. No model changes, no new infra. Just a smarter chunking strategy. Thread on what we learned ↓”
Each record is a post plus its z-scored KPIs against per-account baselines, labeled success / neutral / failure. The retrieval target.
Baselines
Five retrieval methods,
four KPIs, three findings.
Across-KPI mean of the pilot baselines. Outcome-aware retrieval wins on relevance[2][2]nDCG@10: Normalized Discounted Cumulative Gain at rank 10. A standard ranking metric. Higher = better-ordered results. but is failure-blind. Contrastive is the only method that surfaces both successes and failures at non-trivial recall.
The five methods, in plain English
- BM25. Keyword search. Matches the words from the query to the words in past records, weighing rare words more than common ones. The "find-on-page across everything" approach.
- Dense. Semantic search. Every record and every query gets turned into a list of numbers that captures meaning. Close matches come back even when none of the literal words overlap.
- Hybrid. Blends BM25 and dense. Keyword signal plus semantic signal, combined into a single ranking.
- Outcome-aware. Semantic search filtered to past successes only. Wins on relevance because it skips the noise. The trade: it can never surface a failure, so an agent using it can’t learn what to avoid.
- Contrastive. A learned retrieval model trained to keep successes and failures in separate regions of the search space. The only method that returns both at non-trivial recall.
Why contrastive
Wins and losses live in different regions.
Most retrievers cluster by topic. Wins next to wins, losses next to losses. Contrastive learns to keep them in separate regions of the search space, so a single query can pull from either pile.
- Successes
- Failures
- Contrastive boundary
nDCG@10
Higher is better. Outcome-aware leads.
- BM250.188
- Dense0.271
- Hybrid0.243
- Outcome-aware0.500
- Contrastive0.459
Failure recall@10
Higher means the method can surface failures, not just wins.
- BM250.030
- Dense0.031
- Hybrid0.039
- Outcome-aware0.000
- Contrastive0.068
Full numbers
| Method | nDCG@10 | MRR | Success recall@10 | Failure recall@10 |
|---|---|---|---|---|
| BM25 | 0.188 | 0.416 | 0.024 | 0.030 |
| Dense | 0.271 | 0.547 | 0.034 | 0.031 |
| Hybrid | 0.243 | 0.543 | 0.021 | 0.039 |
| Outcome-aware | 0.500 | 0.684 | 0.125 | 0.000 |
| Contrastive | 0.459 | 0.670 | 0.068 | 0.068 |
Across-KPI mean over engagement, reply, like, and amplification rate. Best per column highlighted.
F1
Outcome-aware ≈ 2× dense
Outcome-aware retrieval roughly doubles nDCG@10 over dense retrieval (0.500 vs 0.271 mean), but reaches zero failure-recall@10. Structurally failure-blind.
F2
Contrastive surfaces both
Contrastive retrieval is the only method that returns both successes and failures at non-trivial recall while keeping strong overall nDCG.
F3
Retrieval ≠ pointwise decision
Naive retrieval-vote pairwise prediction is near chance (mean 0.495). The same signal drives recipe selection to top-1 hit 0.667 vs. 0.155 random. The gap is the object of study.
Pipeline
Qdrant-backed memory.
Sparse, dense, hybrid, outcome-aware.
Records flow through scrape, normalize, split, index. The same Qdrant collection serves all five retrieval baselines via sparse, dense, and payload-conditioned queries.
01
Scrape
Public posts across 18 accounts, 2 platforms.
02
Normalize
Per-account/platform/KPI z-scores. Success / neutral / failure labels.
03
Split
Temporal + account + platform held-out.
04
Index
Qdrant: BM25 sparse + dense + hybrid + payload filters.
05
Evaluate
Retrieval, pairwise, ranking, selection, revision.
Manuscript
Retrieval for
Action Selection.
Evaluating Outcome-Conditioned Memory in Long-Running Agents. SIGIR-style draft covering problem formulation, six tasks, dataset, baselines, and the gap between retrieval quality and pointwise decision quality.
- StatusPilot release · scaffold
- DomainGTM content strategy
- Records1,113 across 18 accounts
- SplitsTemporal · account · platform
- IndexQdrant (sparse + dense + hybrid)
- Tasks6 (5 offline, 1 online)