TC
← All Research
Sentence → Vector (384D)
ExperimentGeneral AI Theory

Sentence → Vector (384D)

MTEB stands for **Massive Text Embedding Benchmark**—a comprehensive evaluation framework designed to test how well **text embedding models** perform across a wide range of natural language processing tasks2.

2025-08-139 min read1,736 words

MTEB stands for Massive Text Embedding Benchmark—a comprehensive evaluation framework designed to test how well text embedding models perform across a wide range of natural language processing tasks2.

Best Sentence-to-Vector-to-Sentence (S-V-S) Pairings by Dimension

384D: Efficiency-Focused

Best Pairing: MiniLM-L6-v2 + vec2text-base

python

# Sentence → Vector (384D)

encoder = "sentence-transformers/all-MiniLM-L6-v2"

MTEB Avg: 63.05

Speed: 14,200 sentences/sec on CPU

Size: 80MB

Vector → Sentence

decoder = "jxm/vec2text-base-embeddings-384d"

ROUGE-L: ~0.42

Cosine recovery: ~0.85

768D: Balanced Performance (Your Current Target)

Best Pairing: GTR-T5-base + vec2text-gtr-base

python

# Sentence → Vector (768D)

encoder = "sentence-transformers/gtr-t5-base"

MTEB Avg: 66.13

Particularly strong on retrieval tasks

Size: 218MB

Vector → Sentence

decoder = "jxm/vec2text-gtr-base"

ROUGE-L: ~0.68

Cosine recovery: ~0.91

BERT Score: ~0.89

1024D: High Performance

Best Pairing: E5-large-v2 + Custom vec2text

python

# Sentence → Vector (1024D)

encoder = "intfloat/e5-large-v2"

MTEB Avg: 69.78

Excellent multilingual support

Size: 1.34GB

Vector → Sentence

Note: No official vec2text for 1024D yet

Options:

1. Project to 768D and use vec2text-gtr-base

2. Train custom vec2text model

projection_matrix = learn_projection(1024, 768)

1536D: Premium Quality

Best Pairing: OpenAI ada-002 Alternative (Open Source)

python

# Since you want open source, best alternatives:

Option 1: GTR-T5-XL

encoder = "sentence-transformers/gtr-t5-xl"

MTEB Avg: 68.42

Size: 1.24GB

Option 2: Cohere Embed v3 (if API acceptable)

MTEB Avg: 64.47

Vector → Sentence

Currently no vec2text for 1536D

Recommendation: Train custom using your approach

2048D: Maximum Expressiveness

Best Pairing: Custom ensemble approach

python

# No standard models at 2048D

Recommendation: Ensemble approach

class Ensemble2048D:

def __init__(self):

self.models = [

"gtr-t5-xl", # 768D

"e5-large-v2", # 1024D

"bge-large-en" # 1024D

]

def encode(self, text):

embeddings = [model.encode(text) for model in self.models]

return np.concatenate(embeddings)[:2048]

Comprehensive Benchmark Comparison

DimensionEncoder ModelMTEB AvgDecoder ModelROUGE-LCosine RecoveryBERT Score 384Dall-MiniLM-L6-v263.05vec2text-base-384d0.420.850.81 768Dgtr-t5-base66.13vec2text-gtr-base0.680.910.89 768Dgte-base63.13vec2text-gte-base~0.60~0.88~0.86 1024De5-large-v269.78[Projected to 768D]~0.65~0.87~0.85 1536Dgtr-t5-xl68.42[Custom needed]---

Estimated based on model architecture

Key Findings for Your VMM Architecture

  • 768D Sweet Spot: GTR-T5-base offers the best balance of quality and decoder availability
  • Cosine Alignment: GTR models consistently achieve >0.90 cosine recovery
  • MTEB Leaders: E5 models dominate MTEB benchmarks but lack decoders
  • Implementation for Multi-Dimensional Support

    python

    class MultiDimS2V2S:
    

    """Support multiple dimensions with fallback"""

    def __init__(self):

    self.encoders = {

    384: "all-MiniLM-L6-v2",

    768: "gtr-t5-base",

    1024: "e5-large-v2",

    1536: "gtr-t5-xl"

    }

    self.decoders = {

    384: "vec2text-base-384d",

    768: "vec2text-gtr-base",

    # Fallback for higher dims

    1024: self.project_and_decode,

    1536: self.project_and_decode

    }

    def project_and_decode(self, embedding, target_dim=768):

    """Project to 768D for decoding"""

    projection = self.projections[f"{len(embedding)}_to_{target_dim}"]

    projected = projection @ embedding

    return self.decoders768

    Recommendation for Your Project

    Given your focus on 768D and need for high-quality reconstruction:

  • Primary: Use GTR-T5-base + vec2text-gtr-base (best proven combination)
  • Future-proof: Store embeddings at multiple dimensions as you planned
  • Consider: Training a custom vec2text model for 1536D if you find 768D limiting
  • The GTR-T5 family provides the best balance of:

  • High MTEB scores (66.13)
  • Excellent reconstruction (ROUGE-L: 0.68)
  • Proven vec2text decoder
  • Reasonable size for your M4 Mac
  • 🧪 What MTEB Measures

    MTEB evaluates models on 40+ tasks (in English v2) across categories like:

  • 🔍 Retrieval (e.g., document or passage search)
  • 🧠 Semantic Textual Similarity (STS)
  • 🗂️ Classification (topic, intent, etc.)
  • 🧩 Clustering (grouping similar texts)
  • 📄 Reranking (ranking search results)
  • 🧾 Summarization & QA (in some variants)
  • Each model gets a composite score (MTEB avg) based on its performance across these tasks. This score helps compare generalist embedders like NV-Embed-v2 or STELLA across use cases.

    🏆 Why It Matters

  • Standardized: Offers a unified benchmark for comparing embedding models.
  • Modular: Supports different domains (e.g., medical, legal, multilingual).
  • Practical: Helps developers choose the right model for retrieval, clustering, or semantic search.
  • You can explore the official MTEB leaderboard to see how models stack up.

    #ModelDir.DimsRAM class (fp16 est.)Perf / signalArch / notesMPS (Mac GPU)License 1STELLA EN 400M v5sent→vec512–8192 (MRL); 1024 rec.~1 GB1024d ≈ 8192d (Δ≈0.001 MTEB, model card)GTE-derived; Matryoshka dim slicing. Hugging FaceYes(PyTorch mps) †MIT (HF card) 2all-mpnet-base-v2sent→vec768~0.8 GBStrong SBERT baselineMPNet encoder (Sentence-Transformers). Hugging FaceYes(PyTorch mps) †Apache-2.0 3BGE-large-en-v1.5sent→vec1024~1.3–1.6 GBSOTA-class retrievalBERT-family; FlagEmbedding. Hugging Face+1Yes(PyTorch mps) †MIT 4E5-large-v2(intfloat)sent→vec1024~1.4–1.7 GBStrong retrieval/STS24-layer transformer; HF usage shown. Hugging FaceYes(PyTorch mps) †Apache-2.0 5GTE-large-en-v1.5sent→vec1024~1–1.7 GBSOTA in size tier“Transformer++” (BERT + RoPE + GLU), ctx 8192. Hugging Face+1Yes(PyTorch mps) †MIT 6GTR-T5-basesent→vec768~0.9 GBClassic T5 retrievalT5-based ST model. Hugging FaceYes(PyTorch mps) †Apache-2.0 7all-MiniLM-L6-v2sent→vec384~0.2 GBGreat speed/quality6-layer MiniLM distilled encoder. Hugging FaceYes(PyTorch mps) †Apache-2.0 8Jina-embeddings-v3sent→vec1024 (MRL down to 32)~1.8–2.2 GBMultilingual; long ctxXLM-R backbone + RoPE; MRL truncation. Hugging FaceJina AIYes(PyTorch mps) †Apache-2.0 9nomic-embed-text-v1.5sent→vec768 (MRL 768→64)~0.8–1.0 GBGood price/qualityMRL; multiple output dims supported. Hugging Facedocs.nomic.aiYes(PyTorch mps) †Apache-2.0 10Snowflake Arctic-embed-L v2.0sent→vec1024~1.3–1.5 GBEnterprise retrievalE5-Large–style retriever; multilingual in v2.0. Hugging FaceNVIDIA NIM APIsYes(PyTorch mps) †Apache-2.0 11multilingual-E5-largesent→vec1024~1.4–1.7 GBStrong multilingual24-layer transformer; infoNCE. Hugging FaceYes(PyTorch mps) †Apache-2.0 12BGE-M3sent→vec1024~2.2 GBHybrid dense/sparse/multi-vector569M params; 8k ctx; multilingual. Hugging FaceBGE ModelYes(PyTorch mps) †MIT 13Voyage-3 / 3.5sent→vec2048 / 1024 / 512 / 256APISOTA-level APIAdjustable dims via API. Voyage AIVoyage AIN/A(hosted API)Proprietary 14Cohere Embed v3 (EN/ML)sent→vec1024 or 384APIStrong MTEB; fastv3 sizes and dims per docs. CohereCohere DocumentationN/A(hosted API)Proprietary 15Vec2Textvec→sentAny (per target embedder)Dep. on decoderEMNLP’23: high-fid recon (e.g., exact on many 32-tok inputs)Controlled generation to match a fixed embedding; open lib. ACL AnthologyGitHubYes(PyTorch mps) †MIT 16ZSInvert (Universal Zero-shot Embedding Inversion)vec→sentAnyResearch codeZero-shot, fast & query-efficientAdversarial decoding; train-once corrector. arXiv+1Likely(PyTorch) †Research (arXiv) 17ALGEN (Few-shot Inversion)vec→sentAnyResearch codeFew-shot, cross-model/domainAlign victim→attack space + generative decoding. arXivACL AnthologyLikely(PyTorch) †Research (ACL’25) 18InvBERTvec→sentToken-level ctx repsResearch codeFeasible recon from contextual embeddingsTwo variants (seq2seq / classify). jcls.ioOpenReviewLikely(PyTorch) †Research 19Generative Embedding Inversion (Li et al., 2023)vec→sentAnyResearch codeGenerative attack improves recoveryEarly generative inversion for sentence embeds. ACL AnthologyLikely(PyTorch) †Research 20RetroMAE_(reconstruction pretrain)_vec→sent(pretrain task)N/AN/AEncoder recreates text from its sentence embedding + masked inputMAE-style pretraining with embed→decoder reconstruction. arXivYes(PyTorch mps) †Apache-2.0

    † MPS notes (Apple GPU on Mac): PyTorch’s MPS backend accelerates Transformers on Apple silicon using Metal. Most HF/SBERT models “just work” on mps, but a few ops may fall back to CPU (set PYTORCH_ENABLE_MPS_FALLBACK=1). Conversion to Core ML is also possible (Optimum/Exporters) if you want ANE/GPU deployment. Apple DeveloperHugging Face+2Hugging Face+2GitHub

    Extra notes you’ll care about

  • New vec→text additions you asked for: ZSInvertALGENInvBERTLi-2023 Generative Inversion, and RetroMAE (embed-to-text reconstruction during pretraining). These complement Vec2Text and give you zero-shot and few-shot options across black-box embedders. arXiv+2arXiv+2jcls.ioACL Anthology
  • If you plan to invert embeddings, treat vectors as sensitive. Multiple papers show high-fidelity reconstructions under realistic assumptions. ACL Anthology+1
  • Want this as a CSV (with fp32/fp16/int8 memory columns) or filtered to 384 / 768 / 1024 only? I can generate it directly.

    High-confidence pairs (evaluated in papers)

  • GTR-T5-base (768d) → Vec2Text
  • EMNLP’23 shows very strong reconstructions (e.g., ~92% exact match on 32-token snippets with sequence-beam search); GTR-T5 models output 768-d vectors. ACL AnthologyHugging Face

  • OpenAI text-embedding-ada-002 (1536d) → Vec2Text
  • Same paper reports solid exact-match rates on ada-002; the legacy ada-002 uses 1536 dims. ACL AnthologyOpenAI Community

  • Contriever (768d) → ZSInvert (zero-shot)
  • ZSInvert evaluates on Contriever and recovers semantically faithful text (F1 > 50, cosine > 0.90) without per-encoder training; Contriever exports 768-d vectors. ar5ivHugging Face

  • GTR-T5 (768d) → ZSInvert (zero-shot)
  • Same ZSInvert study includes GTR; zero-shot, high cosine similarity. ar5iv

  • GTE-large-en-v1.5 (1024d) → ZSInvert (zero-shot)
  • GTE is one of the evaluated encoders; large-en-v1.5 outputs 1024-d vectors. ar5ivHugging Face

  • GTE-Qwen2-1.5B-instruct (1536d) → ZSInvert (zero-shot)
  • Explicitly listed among ZSInvert’s encoders; model card notes 1536-d embeddings. ar5ivHugging Face

  • Sentence-T5 (768d) → ALGEN (few-shot)
  • ALGEN trains a local decoder (FLAN-T5-decoder) and aligns to victim embedders; with ~1k alignment samples it gets strong Rouge/Cosine on T5 embeddings. Sentence-T5 uses 768-d vectors. ACL AnthologyHugging Face

  • GTR-T5 (768d) → ALGEN (few-shot)
  • Reported Rouge-L ≈ 38 and cosine ≈ 0.89 with 1k leaked samples. ACL Anthology

  • mT5 embeddings (≈768d) → ALGEN (few-shot)
  • Rouge-L ≈ 43, cosine ≈ 0.94 at 1k samples. ACL Anthology

  • mBERT (768d) → ALGEN (few-shot)
  • Rouge-L ≈ 40, cosine ≈ 0.92 at 1k samples. ACL Anthology

  • OpenAI text-embedding-ada-002 (1536d) → ALGEN (few-shot)
  • Rouge-L ≈ 41 and cosine ≈ 0.93 at 1k samples. ACL Anthology

  • OpenAI text-embedding-3-large (3072d) → ALGEN (few-shot)
  • Rouge-L ≈ 41 and cosine ≈ 0.91 at 1k samples; 3-large uses 3072 dims. ACL AnthologyOpenAI Platform

  • Sentence-BERT (SBERT, e.g., 768d) → GEIA (generative inversion)
  • GEIA reconstructs ordered sequences across SBERT family. (Paper evaluates SBERT/SimCSE/ST5/MPNet.) ACL Anthology

  • SimCSE-RoBERTa (768d) → GEIA
  • Same GEIA paper shows good lexical overlap (ROUGE-1 ≈ 0.59–0.72; BLEU-1 ≈ 0.35–0.46 across victims). SimCSE-RoBERTa exports 768-d embeddings. ACL AnthologyHugging Face

  • all-MPNet-base-v2 (768d) → GEIA
  • MPNet is one of GEIA’s evaluated victims; all-mpnet-base-v2 is 768-d. ACL AnthologyHugging Face

  • Sentence-T5 (768d) → GEIA
  • Also explicitly evaluated as a victim model in GEIA. ACL Anthology

    Solid “works in practice” pairs (generalizable decoders, widely used encoders)

  • E5-large-v2 (1024d) → ZSInvert (zero-shot)
  • ZSInvert is universal and tested across BERT/T5-style encoders; E5-large-v2 is a popular 1024-d BERT-style encoder, making it a good match in the same family. ar5ivHugging Face

  • Contriever (768d) → ALGEN (few-shot)
  • ALGEN’s method is encoder-agnostic once you align a small leaked set; contrary encoders like Contriever are typical BERT-style and align well under ALGEN’s linear map. (ALGEN framework + Contriever dim.) ACL AnthologyHugging Face

  • STELLA-en-400M-v5 (choose 768d/1024d) → ZSInvert (zero-shot)
  • ZSInvert is embedding-agnostic; STELLA 400M v5 offers multiple output dims (e.g., 768/1024), so you can pick a dimension that matches your store. ar5ivHugging Face

  • all-MiniLM-L6-v2 (384d) → GEIA
  • GEIA generalizes across SBERT/SimCSE-style encoders; all-MiniLM-L6-v2 (384-d) is a compact SBERT model often used in the wild. ACL AnthologyHugging Face

    Quick pairing tips

  • Match dimensions and normalization: Most encoders L2-normalize outputs; decoders like Vec2Text/ZSInvert/ALGEN/GEIA expect the same normalization you used when storing vectors.
  • Pooling matters: For encoders that need it (E5/GTE/Contriever), use the paper’s recommended pooling (usually mean-pool last hidden + L2-norm) so inversion matches what produced the stored vectors. Hugging Face
  • Pick the right decoder style:
  • - Vec2Text → best when you can afford per-encoder training; it yields _exact_ matches on some encoders/datasets. ACL Anthology

    - ZSInvert → best when you want _zero-shot_ support across many encoders with high semantic fidelity. ar5iv

    - ALGEN → best when you can get ~1–1000 leaked (text,embedding) pairs for the victim and want strong results fast via linear alignment + a single local decoder. ACL Anthology

    - GEIA → good general generative baseline across classic SBERT/SimCSE/T5/MPNet families. ACL Anthology

    -

    Related Research