AI Engineer World's Fair
Embedding geometry · Loop · Agentic search

Autoresearch &
test-time compute for
information retrieval

Han Xiao
VP of AI, Elastic
@hxiao  ·  in/hxiao87
scan for the slides
QR code to hanxiao.io/aie-sf-2026
hanxiao.io/aie-sf-2026
Test-time compute

Spend more compute at inference, get a better answer.

Trade inference compute for quality: rather than a bigger model, the model does more work per query, along a smooth cost-versus-quality curve.

Best-of-N
sample many candidates, keep the best.
Self-consistency
majority-vote over sampled chains of thought.
Verifier search
spend inference FLOPs, climb a Pareto curve.
Having a bot think for just 20 seconds in a hand of poker got the same performance boost as scaling up the model by 100,000× and training it for 100,000 times longer.Noam Brown, OpenAI, 2024
The reframe

Search is test-time compute.

You wire trained embeddings, rerankers, multi-vector retrievers, and query expanders into a pipeline at inference to squeeze out relevance. You do not need a bigger model. You assemble more search at test time.

Don't scale it
one keyword match. Cheap, and probably not good enough.
Scale it
embed, retrieve, rerank, expand, fuse. More inference can buy more relevance.

The question is not whether the model is big enough. It is how much search pipeline you assemble at test time, and whether it pays off.

Two versions

Two ways to manufacture that pipeline at test time.

A   Recombine the geometry
An agentic loop writes programs over one frozen encoder: chunk, z-score, fuse channels, feed back. The pipeline is multi-pass embedding algebra.
this talk's core
B   Compose the tools
A small agent LLM wires retrieval tools (grep, embed, rerank, select_diverse) over a corpus under a budget. The pipeline is an agentic tool chain.
searchbox  ·  later
Version A

"Small models can't improve." But they are distilled from LLMs.

Test-time compute is said to belong to large reasoning models, with small models unable to improve. Yet today's embedders - including the small deployable ones - are distilled or adapted from LLM backbones:

Mistral-7B
E5-MistralSFR-EmbeddingGritLMNV-Embedbge-en-icl
Qwen3
qwen3-embedjina-embeddings-v5
Gemma 3
EmbeddingGemma
LLaMA
RepLLaMALLM2Vec

The lineage from a large language model to a small deployable encoder is now the dominant recipe, not the exception.

If test-time compute lives in the LLM representation space, these distilled encoders should inherit it. Do they?

The intuition

Scoring runs from one cosine to late interaction.

single-vector cosine
\( \cos(q,\,d) \)
one vector per document
frozen baseline
sentence MaxSim
\( \max_{s}\ \cos(q,\,s) \)
same encoder, document split into sentence vectors
test-time compute
ColBERT multi-vector
\( \sum_i \max_j\ \cos(q_i,\,d_j) \)
every query token, max over all doc tokens
needs a multi-vector model
The question

How much can a frozen single-vector encoder improve at inference alone?

No retrainingNo auxiliary modelNo learned parametersOne frozen encoder API

The prominent recent test-time methods for retrieval each break at least one rule:

HyDE / Query2Doc
an external LLM in the query path.
GQR
a second, supervisory retriever.
MetaEmbed
trained extra parameters at deployment.

We forbid all three, and ask whether the improvement scales with the compute spent.

Autoresearch

Autoresearch: let the agent climb the hill.

An agent runs the research loop itself: change one file, run a short fixed-budget experiment, keep the change if the metric improved, otherwise revert. Repeat, overnight. It is hill-climbing, with an LLM as the mutation function.

change a file run a short experiment keep if better, else revert
You're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md files that set up your autonomous research org.Andrej Karpathy, Anthropic, 2026
Method

An LLM agent writes programs over the frozen encoder, and an evaluator scores them.

Proposer
LLM agent, Opus 4.6
Program P
Python over j-v5-nano
Evaluator
ΔnDCG, 14 tasks
Memory
JSONL file
Registry
144 programs
the loop: memory conditions the next program
writes1 program
per generation
the encoder239M, frozen
small + fast
scores on14 MMTEB tasks
ΔnDCG@10
logsΔ, cost, parent,
lesson per program
after G = 144144 programs
scored
Method · 1  Proposer

The proposer is an LLM, used as the mutation function.

Proposer
LLM agent, Opus 4.6
Program P
Python over j-v5-nano
Evaluator
ΔnDCG, 14 tasks
Memory
JSONL file
Registry
144 programs
Design

Opus 4.6 reads the current frontier source and the memory file, edits one Python program, and proposes the next. Keep it if the metric improves, else revert. No human in the inner loop.

Caveat

It optimizes exactly the metric you hand it - not the one you mean. Reward in-domain performance and reward spending compute, and that is what it will chase. Whether the improvement survives out of domain is a separate question. The objective is the steering wheel.

Method · 2  Program

The program is arbitrary Python over the frozen encoder.

Proposer
LLM agent, Opus 4.6
Program P
Python over j-v5-nano
Evaluator
ΔnDCG, 14 tasks
Memory
JSONL file
Registry
144 programs
Design
rerank(q_emb, d_emb, sim, *,
      embed_fn, q_texts, d_texts) -> scores

embed_fn is the test-time-compute budget. It re-embeds any text, switches the LoRA adapter (retrieval / passage / classification / matching), picks a Matryoshka dimension, or caps length. One call is one unit of compute.

Caveat

Auto-rejected: hyperparameters, stochastic ops, task-specific routing, external models, learned parameters, trivial constant-only variants. These constraints force task-agnostic structure, not a per-task tuned config.

Method · 3  Evaluator

The evaluator scores every program on the same 14 tasks.

Proposer
LLM agent, Opus 4.6
Program P
Python over j-v5-nano
Evaluator
ΔnDCG, 14 tasks
Memory
JSONL file
Registry
144 programs
Design

Every program runs on the same 14 MMTEB Tier-1 discovery tasks (legal, financial, long-document, general). The evaluator scores ΔnDCG@10 against the cosine baseline, plus a cost ratio (embed_fn calls per query). The same fixed budget every generation.

Caveat

The loop only ever sees these 14 tasks. The 19 held-out tasks never enter the loop, so a program can win in-domain and still fail out of domain. That gap is the entire experiment.

Method · 4  Memory

Every program is logged, and the log conditions the next one.

Proposer
LLM agent, Opus 4.6
Program P
Python over j-v5-nano
Evaluator
ΔnDCG, 14 tasks
Memory
JSONL file
Registry
144 programs
Design

A JSONL file, one row per program: per-task ΔnDCG@10, win/tie/loss, cost ratio, parent, a novelty claim, and a post-hoc lesson. The proposer reads the frontier source plus this history, so each round builds on the last.

Caveat

Compounding cuts both ways: memory builds on real wins, but it also compounds whatever bias the objective has. A biased metric does not just mislead one program, it steers the whole lineage.

Setup

What we search on, and what we test on.

Discovery · the loop searches here
j-v5-nano
239M · Jina
the discovery encoder
small + fast → quick loop cycles
searched over14 MMTEB Tier-1 tasks · legal, financial, long-doc, general
one program,
must generalize
Held-out · the loop never sees these
j-v5-small
568M · Jina
same family
gemma-300m
303M · Gemma 3
unseen family
qwen3-0.6b
600M · Qwen3
unseen family
unseen families → the real transfer test
tested on19 MMTEB Tier-2/3 tasks · none in discovery (summarization, QA, fact-check, ...)

Metric: ΔnDCG@10, the program's nDCG@10 minus the cosine baseline. Gemma and Qwen share no training data, tokenizer, or adapter with the discovery model; qwen3-0.6b runs under a token-length cap.

The distinction

Cost is one number: extra forward passes through the encoder.

\( c \;=\; 1 \,+\, \dfrac{\text{extra forward passes}}{\text{baseline passes}} \)
Test-time compute  ·  c = 1
SoftCentroid: average the top-k document vectors you already computed, mix into the query, re-score.
q′ = q + mean(d_emb[top-k])
Zero extra forward passes - only algebra on vectors you already have.
Test-time compute  ·  c > 1
FirstSent: take the top doc, re-embed its first sentence, mix into the query, re-score.
q′ = q + embed_fn(first sentence)
One extra forward pass per query - new text through the encoder.

One reuses the geometry you already have; the other spends a forward pass on new text. Which converts into quality that transfers?

Two rubrics, same loop

We run the search under two rubrics.

Compute rubric
Admit a program if its in-domain performance beats every prior one. The proposer is encouraged to spend more inference.
→ how high can the in-domain frontier climb?
Transfer rubric
Admit only if a validation split improves with no task regressing past 0.05. No reward for spending compute.
→ what holds up out of domain?

Neither search ever sees the 19 held-out evaluation tasks or the unseen encoder families.

What compute buys, in-domain

Told to spend compute, the search draws a clean Pareto curve.

144
programs searched over 144 generations
12
Pareto-optimal, from \(c = 1.2\) to \(14.7\)

In-domain mean ΔnDCG@10 climbs from +0.07 to +0.24. This looks exactly like test-time-compute scaling.

In-domain only. The held-out test comes next.

What the search wrote

Twelve programs on the compute frontier.

BidirZScore c=1.2
SentMaxSim c=2.2
AdaptGranularity c=2.7
CoverageTriple c=3.7
LexicalHybridRRF c=3.9
CrossRoundRRF c=3.9
DiverseDualCtx c=5.6
ConsensusRocchio c=6.4
NegContrastive c=7.2
MomentumProg c=9.8
GraphCentrality c=12.2
FisherStability c=14.7

Every one is a training-free recombination of the same frozen vectors. Cost climbs left to right; the improvements, as the next slide shows, do not.

What it buys on held-out data

On unseen encoders, compute is flat. Cheap structure is not.

compute rubric
transfer rubric
14.7×
the most compute a program spends, beaten flat by a zero-pass program at \(c = 1.0\)
−0.016
compute's pooled held-out mean, below baseline
−0.98
its worst per-query cell

The transfer-search program at \(c = 1.0\), zero extra forward passes, already beats the most expensive compute program at \(c = 14.7\). Median is plotted, robust to the \(-0.98\) tail; the pooled mean is negative too.

The compute frontier, cell by cell

Compute helps about half the cells, and collapses the rest.

1.2×cheap
Compute cost
14.7×costly
19 held-out tasks →
−0.98+0.16

Each row is one of twelve programs ordered by compute cost (1.2× to 14.7×); each column is one of nineteen held-out tasks; four encoders, three never seen in discovery. About half the cells are positive (485 of 912), and 52 of 76 task-encoder pairs improve under at least one program - yet the deep-pink cells fall to −0.98, so the pooled mean stays negative at −0.016.

The transfer rubric

A different rubric finds six cheap programs that transfer.

Program\(c\)MedianWin-rateMeanWorst cell
It transfers

Improvements are largest on encoder families never seen in discovery.

mean
median

Discovered on j-v5-nano. Mean ΔnDCG@10 is positive on all four encoders and largest on the two families never seen during discovery; the medians sit near zero on the jina-family encoders, so the improvement is concentrated in a positive tail, not broad. The transfer follows general embedding geometry, not artifacts of the discovery encoder.

Across languages, too

Applied unmodified to French and Greek: median +0.016, an 86% win-rate, and every held-out cell positive on gemma-300m.

Why not just train a head?

A matched-budget learned head memorizes. Structure transfers.

in-domain
held-out
structure (ref)

Trained on the same 14 discovery tasks, a linear, low-rank, or MLP head improves in-domain retrieval by +0.20 to +0.25 ΔnDCG@10, yet falls below baseline on every held-out encoder.

Adding parameters at the same data budget does not transfer. Recombining the frozen geometry does.

What the search found

The recurring structure is classical IR, re-derived in embedding space.

Reciprocal Rank Fusion
rediscovered, not seeded
fuse two rankings into one consensus
Fisher Linear Discriminant
rediscovered, not seeded
the axis that best separates relevant from not
Rocchio Pseudo-Relevance Feedback
operationalized from a seeded idea
pull the query toward the relevant centroid
Sentence-level MaxSim
operationalized from a seeded idea
score the best sentence, not the mean

Not new programs - four classical methods the search keeps re-deriving across both frontiers: two rediscovered (RRF, Fisher), two operationalized from seeds (Rocchio, MaxSim). The structure is geometric (z-scoring, sub-document granularity, centroid feedback) and depends on cosine geometry, not model-specific training, so the cheap forms carry to encoder families never seen during discovery.

The trend

Test-time compute is the pattern for deep research and long-horizon tasks.

2025 · Deep research
Query
loop until budget
SearchReadReason
Answer
one web-bound loop · all Web IO
2026 · Long-horizon tasks
Query
research
SearchReadReason
dataroom
agent loop
ReadRunWrite
Output
phase 1 research · Web IO  →  build the corpus · phase 2 runs offline · NO Web IO

Research and execution want different tokens: build the dataroom with cheap local ones, and save the costly frontier budget for the execution that actually needs it. That two-tier split is dataroom, then searchbox, next.

Version B · the corpus stage
GitHub Repo QR code to github.com/hanxiao/dataroom

dataroom: a loop for knowledge dump.

Given a token budget, spend it on local small models instead of a frontier model: run search → read → write, repeat, until the knowledge is dumped into one cited .zip. That dump is the dataroom - the open web distilled down to a small, local corpus a machine can consume.

dataroom
build the corpus (.zip)
searchbox
answer from the .zip

Stage one of two: the grounded .zip then goes to searchbox (next) or a frontier model for the expensive second stage.

dataroom job dashboard
Live job page: progress to the coverage floor, token usage, and the tool-call distribution. Outcome-based stopping, not a token budget.
Version B · the search stage
GitHub Repo QR code to github.com/hanxiao/searchbox

searchbox: a testbed for studying agentic search loops.

An airgapped testbed for search as test-time compute: lock an agent in the box with one .zip dataroom and no web, so it can only answer by composing its own pipeline from local tools - grep, embed, rerank, similarity, cluster, select_diverse. Nothing leaks in; the search has to exhaust what is in the box.

Open research questions

Which tool does the agent reach for first?

Is grep all you need: where does a dense retriever add nothing?

Does forcing more token budget (scaling TTC) help on the hard questions?

searchbox: a dataroom.zip handed into an airgapped qwen3.6 loop that answers
the airgapped loop, illustrated
Version B · the verifier
GitHub Repo QR code to github.com/hanxiao/knowledge-graph-extractor

knowledge-graph: hard multi-hop questions for a private verifier.

Trivial questions are useless for agentic search: if one grep finds the answer, every method scores the same. To evaluate searchbox you need questions that force the search to actually work.

How

Turn the corpus into a knowledge graph - each fact a (subject)-[predicate]->(object) edge - then walk its longest paths. Those chains become hard multi-hop questions no single passage answers: a private, corpus-grounded eval, grown from the same corpus searchbox is locked inside.

knowledge-graph UI: corpus facts extracted into a force-directed graph with a longest-path view
Live graph: every fact an edge; the longest-path view surfaces the multi-hop questions.
Connecting the dots

Both manufacture a search pipeline at test time. Neither grows the model.

AEmbedding algebra
test-timemulti-pass embedding algebra
what scalesstructure, not forward passes
BAgentic search pipeline
test-timea chain of retrieval tools
what scalestool composition, not parameters
AI Engineer World's Fair
Thank you

Search is test-time compute
autoresearch scales it

Han Xiao  ·  VP of AI, Elastic
github.com/hanxiao  ·  arXiv:2605.11374
@hxiao  ·  in/hxiao87
follow on X
QR code to x.com/hxiao
x.com/hxiao
LinkedIn
QR code to linkedin.com/in/hxiao87
in/hxiao87
these slides
QR code to hanxiao.io/aie-sf-2026
hanxiao.io/aie-sf-2026
Elastic hackathon this evening
QR code to luma.com/aws-elastic-hacknight
register on Luma
Autoresearch & test-time compute for information retrieval
X: @hxiao1 / 30