TL;DR

Text embeddings are widely assumed to be safe, irreversible representations. This work demonstrates otherwise: given only an embedding vector, we reconstruct the original text using conditional masked diffusion.

Paper: arXiv:2602.11047
Demo: https://embedding-inversion-demo.jina.ai
Code: https://github.com/jina-ai/embedding-inversion-demo

We would appreciate references to the following paper if you use this work:

Embedding Inversion via Conditional Masked Diffusion Language Models. Han Xiao. arXiv:2602.11047

Bibtex entry:

1
2
3
4
5
6
@article{xiao2026embedding,
title={Embedding Inversion via Conditional Masked Diffusion Language Models},
author={Xiao, Han},
journal={arXiv preprint arXiv:2602.11047},
year={2026}
}

What is Embedding Inversion?

Existing inversion methods (Vec2Text, ALGEN, Zero2Text) generate tokens autoregressively and require iterative re-embedding through the target encoder. This creates two bottlenecks: attack cost scales with correction iterations, and left-to-right generation accumulates errors with no mechanism to revise earlier tokens.

We take a different approach: embedding inversion as conditional masked diffusion. Starting from a fully masked sequence, a denoising model reveals tokens at all positions in parallel, conditioned on the target embedding via adaptive layer normalization (AdaLN-Zero). Each denoising step refines all positions simultaneously using global context, without ever re-embedding the current hypothesis.

The approach is encoder-agnostic by construction. The embedding vector enters only through AdaLN modulation of layer normalization parameters, so the same architecture applies to any embedding model without alignment training or architecture-specific modifications.

Key Results

On 32-token sequences across three embedding models, the method achieves:

  • 81.3% token accuracy (Qwen3-Embedding-0.6B)
  • 0.87 cosine similarity
  • 8 forward passes through a 78M parameter model
  • No access to the target encoder during inference

Training details:

  • Model: 8-layer Transformer with AdaLN-Zero conditioning
  • Total params: ~78M trainable
  • Training: 2M multilingual samples from C4/mC4
  • Four encoders trained in parallel on A100-40GB GPUs
Encoder Vocab Acc Steps Batch Data
Qwen3-Embedding-0.6B 152K 81.3% 72.5K 900 2M multilingual
jina-embeddings-v3 250K 77.3% 67K 600 2M en
EmbeddingGemma-300m 262K 78.5% 47.5K 500 2M multilingual
jina-embeddings-v3 (ML) 250K 75.2% 60K 600 2M multilingual

Try It Yourself

Check out the live demo to see it in action, or run it locally:

1
2
3
4
git clone https://github.com/jina-ai/embedding-inversion-demo.git
cd embedding-inversion-demo
pip install -r requirements.txt
python demo_server.py

The demo lets you enter any text, encode it with different embedding models, and reconstruct the original text from the embedding alone.

Privacy Implications

This work raises important questions about embedding privacy. While embeddings are often treated as anonymized representations, our results show they can leak significant information about the original text. This has implications for:

  • Federated learning systems that share embeddings
  • Embedding APIs that expose vectors
  • Privacy-preserving IR systems
  • Semantic search applications

The attack requires only black-box access to the embedding model (to generate training data) but does not query it during inference, making it practical even when API rate limits exist.