Embedding geometry · Loop · Agentic search

Autoresearch &
test-time compute for
information retrieval

Han Xiao

VP of AI, Elastic

@hxiao · in/hxiao87

scan for the slides

hanxiao.io/aie-sf-2026

Test-time compute

Spend more compute at inference, get a better answer.

Trade inference compute for quality: rather than a bigger model, the model does more work per query, along a smooth cost-versus-quality curve.

Best-of-N

sample many candidates, keep the best.

Self-consistency

majority-vote over sampled chains of thought.

Verifier search

spend inference FLOPs, climb a Pareto curve.

Having a bot think for just 20 seconds in a hand of poker got the same performance boost as scaling up the model by 100,000× and training it for 100,000 times longer.Noam Brown, OpenAI, 2024

The reframe

Search is test-time compute.

You wire trained embeddings, rerankers, multi-vector retrievers, and query expanders into a pipeline at inference to squeeze out relevance. You do not need a bigger model. You assemble more search at test time.

Don't scale it

one keyword match. Cheap, and probably not good enough.

Scale it

embed, retrieve, rerank, expand, fuse. More inference can buy more relevance.

The question is not whether the model is big enough. It is how much search pipeline you assemble at test time, and whether it pays off.

Two versions

Two ways to manufacture that pipeline at test time.

A Recombine the geometry

An agentic loop writes programs over one frozen encoder: chunk, z-score, fuse channels, feed back. The pipeline is multi-pass embedding algebra.

this talk's core

B Compose the tools

A small agent LLM wires retrieval tools (grep, embed, rerank, select_diverse) over a corpus under a budget. The pipeline is an agentic tool chain.

searchbox · later

Version A

"Small models can't improve." But they are distilled from LLMs.

Test-time compute is said to belong to large reasoning models, with small models unable to improve. Yet today's embedders - including the small deployable ones - are distilled or adapted from LLM backbones:

Mistral-7B

→

E5-MistralSFR-EmbeddingGritLMNV-Embedbge-en-icl

Qwen3

→

qwen3-embedjina-embeddings-v5

Gemma 3

→

EmbeddingGemma

LLaMA

→

RepLLaMALLM2Vec

The lineage from a large language model to a small deployable encoder is now the dominant recipe, not the exception.

If test-time compute lives in the LLM representation space, these distilled encoders should inherit it. Do they?

The intuition

Scoring runs from one cosine to late interaction.

single-vector cosine

\( \cos(q,\,d) \)

one vector per document

frozen baseline

sentence MaxSim

\( \max_{s}\ \cos(q,\,s) \)

same encoder, document split into sentence vectors

test-time compute

ColBERT multi-vector

\( \sum_i \max_j\ \cos(q_i,\,d_j) \)

every query token, max over all doc tokens

needs a multi-vector model

The question

How much can a frozen single-vector encoder improve at inference alone?

No retrainingNo auxiliary modelNo learned parametersOne frozen encoder API

The prominent recent test-time methods for retrieval each break at least one rule:

HyDE / Query2Doc

an external LLM in the query path.

GQR

a second, supervisory retriever.

MetaEmbed

trained extra parameters at deployment.

We forbid all three, and ask whether the improvement scales with the compute spent.

Autoresearch

Autoresearch: let the agent climb the hill.

An agent runs the research loop itself: change one file, run a short fixed-budget experiment, keep the change if the metric improved, otherwise revert. Repeat, overnight. It is hill-climbing, with an LLM as the mutation function.

change a file→ run a short experiment→ keep if better, else revert↻

You're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md files that set up your autonomous research org.Andrej Karpathy, Anthropic, 2026

Method

An LLM agent writes programs over the frozen encoder, and an evaluator scores them.

Proposer

LLM agent, Opus 4.6

→

Program P

Python over j-v5-nano

→

Evaluator

ΔnDCG, 14 tasks

→

Memory

JSONL file

→

Registry

144 programs

the loop: memory conditions the next program

writes1 program
per generation

→

the encoder239M, frozen
small + fast

→

scores on14 MMTEB tasks
ΔnDCG@10

→

logsΔ, cost, parent,
lesson per program

→

after G = 144144 programs
scored

Method · 1 Proposer

The proposer is an LLM, used as the mutation function.

Proposer

LLM agent, Opus 4.6

→

Program P

Python over j-v5-nano

→

Evaluator

ΔnDCG, 14 tasks

→

Memory

JSONL file

→

Registry

144 programs

Design

Opus 4.6 reads the current frontier source and the memory file, edits one Python program, and proposes the next. Keep it if the metric improves, else revert. No human in the inner loop.

Caveat

It optimizes exactly the metric you hand it - not the one you mean. Reward in-domain performance and reward spending compute, and that is what it will chase. Whether the improvement survives out of domain is a separate question. The objective is the steering wheel.

Method · 2 Program

The program is arbitrary Python over the frozen encoder.

Proposer

LLM agent, Opus 4.6

→

Program P

Python over j-v5-nano

→

Evaluator

ΔnDCG, 14 tasks

→

Memory

JSONL file

→

Registry

144 programs

Design

rerank(q_emb, d_emb, sim, *,
embed_fn, q_texts, d_texts) -> scores

embed_fn is the test-time-compute budget. It re-embeds any text, switches the LoRA adapter (retrieval / passage / classification / matching), picks a Matryoshka dimension, or caps length. One call is one unit of compute.

Caveat

Auto-rejected: hyperparameters, stochastic ops, task-specific routing, external models, learned parameters, trivial constant-only variants. These constraints force task-agnostic structure, not a per-task tuned config.

Method · 3 Evaluator

The evaluator scores every program on the same 14 tasks.

Proposer

LLM agent, Opus 4.6

→

Program P

Python over j-v5-nano

→

Evaluator

ΔnDCG, 14 tasks

→

Memory

JSONL file

→

Registry

144 programs

Design

Every program runs on the same 14 MMTEB Tier-1 discovery tasks (legal, financial, long-document, general). The evaluator scores ΔnDCG@10 against the cosine baseline, plus a cost ratio (embed_fn calls per query). The same fixed budget every generation.

Caveat

The loop only ever sees these 14 tasks. The 19 held-out tasks never enter the loop, so a program can win in-domain and still fail out of domain. That gap is the entire experiment.

Method · 4 Memory

Every program is logged, and the log conditions the next one.

Proposer

LLM agent, Opus 4.6

→

Program P

Python over j-v5-nano

→

Evaluator

ΔnDCG, 14 tasks

→

Memory

JSONL file

→

Registry

144 programs

Design

A JSONL file, one row per program: per-task ΔnDCG@10, win/tie/loss, cost ratio, parent, a novelty claim, and a post-hoc lesson. The proposer reads the frontier source plus this history, so each round builds on the last.

Caveat

Compounding cuts both ways: memory builds on real wins, but it also compounds whatever bias the objective has. A biased metric does not just mislead one program, it steers the whole lineage.

Setup

What we search on, and what we test on.

Discovery · the loop searches here

j-v5-nano

239M · Jina

the discovery encoder

small + fast → quick loop cycles

searched over14 MMTEB Tier-1 tasks · legal, financial, long-doc, general

→

one program,
must generalize

Held-out · the loop never sees these

j-v5-small

568M · Jina

same family

gemma-300m

303M · Gemma 3

unseen family

qwen3-0.6b

600M · Qwen3

unseen family

unseen families → the real transfer test

tested on19 MMTEB Tier-2/3 tasks · none in discovery (summarization, QA, fact-check, ...)

Metric: ΔnDCG@10, the program's nDCG@10 minus the cosine baseline. Gemma and Qwen share no training data, tokenizer, or adapter with the discovery model; qwen3-0.6b runs under a token-length cap.

The distinction

Cost is one number: extra forward passes through the encoder.

\( c \;=\; 1 \,+\, \dfrac{\text{extra forward passes}}{\text{baseline passes}} \)

Test-time compute · c = 1

SoftCentroid: average the top-k document vectors you already computed, mix into the query, re-score.

q′ = q + mean(d_emb[top-k])

Zero extra forward passes - only algebra on vectors you already have.

Test-time compute · c > 1

FirstSent: take the top doc, re-embed its first sentence, mix into the query, re-score.

q′ = q + embed_fn(first sentence)

One extra forward pass per query - new text through the encoder.

One reuses the geometry you already have; the other spends a forward pass on new text. Which converts into quality that transfers?

Two rubrics, same loop

We run the search under two rubrics.

Compute rubric

Admit a program if its in-domain performance beats every prior one. The proposer is encouraged to spend more inference.

→ how high can the in-domain frontier climb?

Transfer rubric

Admit only if a validation split improves with no task regressing past 0.05. No reward for spending compute.

→ what holds up out of domain?

Neither search ever sees the 19 held-out evaluation tasks or the unseen encoder families.

What compute buys, in-domain

Told to spend compute, the search draws a clean Pareto curve.

144

programs searched over 144 generations

12

Pareto-optimal, from \(c = 1.2\) to \(14.7\)

In-domain mean ΔnDCG@10 climbs from +0.07 to +0.24. This looks exactly like test-time-compute scaling.

In-domain only. The held-out test comes next.

What the search wrote

Twelve programs on the compute frontier.

BidirZScore c=1.2

SentMaxSim c=2.2

AdaptGranularity c=2.7

CoverageTriple c=3.7

LexicalHybridRRF c=3.9

CrossRoundRRF c=3.9

DiverseDualCtx c=5.6

ConsensusRocchio c=6.4

NegContrastive c=7.2

MomentumProg c=9.8

GraphCentrality c=12.2

FisherStability c=14.7

Every one is a training-free recombination of the same frozen vectors. Cost climbs left to right; the improvements, as the next slide shows, do not.

What it buys on held-out data

On unseen encoders, compute is flat. Cheap structure is not.

compute rubric

transfer rubric

14.7×

the most compute a program spends, beaten flat by a zero-pass program at \(c = 1.0\)

−0.016

compute's pooled held-out mean, below baseline

−0.98

its worst per-query cell

The transfer-search program at \(c = 1.0\), zero extra forward passes, already beats the most expensive compute program at \(c = 14.7\). Median is plotted, robust to the \(-0.98\) tail; the pooled mean is negative too.

The compute frontier, cell by cell

Compute helps about half the cells, and collapses the rest.

1.2×cheap

Compute cost

14.7×costly

19 held-out tasks →

−0.98+0.16

Each row is one of twelve programs ordered by compute cost (1.2× to 14.7×); each column is one of nineteen held-out tasks; four encoders, three never seen in discovery. About half the cells are positive (485 of 912), and 52 of 76 task-encoder pairs improve under at least one program - yet the deep-pink cells fall to −0.98, so the pooled mean stays negative at −0.016.

The transfer rubric

A different rubric finds six cheap programs that transfer.

Program	\(c\)	Median	Win-rate	Mean	Worst cell

It transfers

Improvements are largest on encoder families never seen in discovery.

mean

median

Discovered on j-v5-nano. Mean ΔnDCG@10 is positive on all four encoders and largest on the two families never seen during discovery; the medians sit near zero on the jina-family encoders, so the improvement is concentrated in a positive tail, not broad. The transfer follows general embedding geometry, not artifacts of the discovery encoder.

Across languages, too

Applied unmodified to French and Greek: median +0.016, an 86% win-rate, and every held-out cell positive on gemma-300m.

Why not just train a head?

A matched-budget learned head memorizes. Structure transfers.

in-domain

held-out

structure (ref)

Trained on the same 14 discovery tasks, a linear, low-rank, or MLP head improves in-domain retrieval by +0.20 to +0.25 ΔnDCG@10, yet falls below baseline on every held-out encoder.

Adding parameters at the same data budget does not transfer. Recombining the frozen geometry does.

What the search found

The recurring structure is classical IR, re-derived in embedding space.

Reciprocal Rank Fusion

rediscovered, not seeded

fuse two rankings into one consensus

Fisher Linear Discriminant

rediscovered, not seeded

the axis that best separates relevant from not

Rocchio Pseudo-Relevance Feedback

operationalized from a seeded idea

pull the query toward the relevant centroid

Sentence-level MaxSim

operationalized from a seeded idea

score the best sentence, not the mean

Not new programs - four classical methods the search keeps re-deriving across both frontiers: two rediscovered (RRF, Fisher), two operationalized from seeds (Rocchio, MaxSim). The structure is geometric (z-scoring, sub-document granularity, centroid feedback) and depends on cosine geometry, not model-specific training, so the cheap forms carry to encoder families never seen during discovery.

The trend

Test-time compute is the pattern for deep research and long-horizon tasks.

2025 · Deep research

Query→

loop until budget

Search→Read→Reason

→Answer

one web-bound loop · all Web IO

2026 · Long-horizon tasks

Query→

research

Search→Read→Reason

→dataroom→

agent loop

Read→Run→Write

→Output

phase 1 research · Web IO → build the corpus · phase 2 runs offline · NO Web IO

Research and execution want different tokens: build the dataroom with cheap local ones, and save the costly frontier budget for the execution that actually needs it. That two-tier split is dataroom, then searchbox, next.

Version B · the corpus stage

GitHub Repo

dataroom: a loop for knowledge dump.

Given a token budget, spend it on local small models instead of a frontier model: run search → read → write, repeat, until the knowledge is dumped into one cited .zip. That dump is the dataroom - the open web distilled down to a small, local corpus a machine can consume.

dataroom

build the corpus (.zip)

↓

searchbox

answer from the .zip

Stage one of two: the grounded .zip then goes to searchbox (next) or a frontier model for the expensive second stage.

Live job page: progress to the coverage floor, token usage, and the tool-call distribution. Outcome-based stopping, not a token budget.

Version B · the search stage

GitHub Repo

searchbox: a testbed for studying agentic search loops.

An airgapped testbed for search as test-time compute: lock an agent in the box with one .zip dataroom and no web, so it can only answer by composing its own pipeline from local tools - grep, embed, rerank, similarity, cluster, select_diverse. Nothing leaks in; the search has to exhaust what is in the box.

Open research questions

Which tool does the agent reach for first?

Is grep all you need: where does a dense retriever add nothing?

Does forcing more token budget (scaling TTC) help on the hard questions?

searchbox: a dataroom.zip handed into an airgapped qwen3.6 loop that answers

the airgapped loop, illustrated

Version B · the verifier

GitHub Repo

QR code to github.com/hanxiao/knowledge-graph-extractor

knowledge-graph: hard multi-hop questions for a private verifier.

Trivial questions are useless for agentic search: if one grep finds the answer, every method scores the same. To evaluate searchbox you need questions that force the search to actually work.

How

Turn the corpus into a knowledge graph - each fact a (subject)-[predicate]->(object) edge - then walk its longest paths. Those chains become hard multi-hop questions no single passage answers: a private, corpus-grounded eval, grown from the same corpus searchbox is locked inside.

knowledge-graph UI: corpus facts extracted into a force-directed graph with a longest-path view

Live graph: every fact an edge; the longest-path view surfaces the multi-hop questions.

Connecting the dots

Both manufacture a search pipeline at test time. Neither grows the model.

AEmbedding algebra

test-timemulti-pass embedding algebra

what scalesstructure, not forward passes

BAgentic search pipeline

test-timea chain of retrieval tools

what scalestool composition, not parameters

Thank you

Search is test-time compute
autoresearch scales it

Han Xiao · VP of AI, Elastic

github.com/hanxiao · arXiv:2605.11374
@hxiao · in/hxiao87

follow on X

x.com/hxiao

in/hxiao87

these slides

hanxiao.io/aie-sf-2026

Elastic hackathon this evening

QR code to luma.com/aws-elastic-hacknight

register on Luma

Autoresearch &test-time compute forinformation retrieval

Spend more compute at inference, get a better answer.

Search is test-time compute.

Two ways to manufacture that pipeline at test time.

"Small models can't improve." But they are distilled from LLMs.

Scoring runs from one cosine to late interaction.

How much can a frozen single-vector encoder improve at inference alone?

Autoresearch: let the agent climb the hill.

An LLM agent writes programs over the frozen encoder, and an evaluator scores them.

The proposer is an LLM, used as the mutation function.

The program is arbitrary Python over the frozen encoder.

The evaluator scores every program on the same 14 tasks.

Every program is logged, and the log conditions the next one.

What we search on, and what we test on.

Cost is one number: extra forward passes through the encoder.

We run the search under two rubrics.

Told to spend compute, the search draws a clean Pareto curve.

Twelve programs on the compute frontier.

On unseen encoders, compute is flat. Cheap structure is not.

Compute helps about half the cells, and collapses the rest.

A different rubric finds six cheap programs that transfer.

Improvements are largest on encoder families never seen in discovery.

A matched-budget learned head memorizes. Structure transfers.

The recurring structure is classical IR, re-derived in embedding space.

Test-time compute is the pattern for deep research and long-horizon tasks.

dataroom: a loop for knowledge dump.

searchbox: a testbed for studying agentic search loops.

knowledge-graph: hard multi-hop questions for a private verifier.

Both manufacture a search pipeline at test time. Neither grows the model.

Search is test-time computeautoresearch scales it

Autoresearch &
test-time compute for
information retrieval

Search is test-time compute
autoresearch scales it