Product Requirements Document (PRD)
Title
Vector‑Only Latent Nemotron‑Nano‑2 (Hybrid Mamba)
Version
v1.0 (formal draft)
Owners
Product/Research: Trent Carter & Partner (ChatGPT)
Tech Lead: TBD
ML/Infra: TBD
1) Executive Summary
We will adapt a Nemotron‑Nano‑2–style hybrid Mamba/Transformer into a vector‑only latent model. The core model ingests/produces sequences of latent vectors instead of text tokens. This preserves Mamba’s sequential inductive bias while removing a fixed token vocabulary. A cloud concept store (ANN cosine) provides fast text reconstruction via nearest neighbors; a vec2text fallback handles out‑of‑vocabulary (OOV) cases and bootstraps the store online.
Why now: Enables language‑free reasoning and concept‑native interfaces, while remaining interoperable with GTR‑T5 (768D) and vec2text (IELAB/JXE) for evaluation and text I/O.
2) Goals & Non‑Goals
Goals
Vector‑native I/O: Replace token embeddings with latent vectors (default 768D) and special control vectors (, , , ).
Long‑context learning: Achieve stable training with latent sequence length target L ≈ 8k via curriculum.
Robust decoding: ANN cosine decoding with vec2text fallback and online concept insertion.
Non‑collapse: Prevent representation collapse using AR/MLM/InfoNCE/VICReg + multi‑horizon.
Eval you mandated: BLEU‑4 & ROUGE‑L (Expected Text vs Output Text), and cosine between GTR‑T5(Expected Text) and model output vector.
Non‑Goals (v1)
Multimodal latent training; learned VQ codebooks at scale (optional later); RL alignment.
3) User Stories
R&D Engineer: “I can pass sequences of latent concept vectors, train at long context, and decode to text for inspection.”
Systems Engineer: “The concept store scales, supports online inserts, and provides deterministic, low‑latency nearest‑neighbor lookups.”
Evaluator: “I can run standardized BLEU/ROUGE and cosine metrics and compare runs over time.”
4) Functional Requirements
Latent Interface
- Input vectors z∈R768 (pluggable dim), unit‑norm; bridge to internal dim (default 512) with LayerNorm.
- Special control latents: trainable , , , .
- Output step: 768D unit‑norm vector per position.
Training‑Time Latentizer
- Deterministic clause/phrase segmentation (~17 words → 1 concept on average).
- Pack sequences up to L; prepend , append , pad.
- Generate two stochastic views per sequence for contrastive learning (paraphrase/noise/segment jitter).
Backbone & Heads
- Replace token embedding with bridge(768→512).
- Keep Mamba/Transformer blocks; minimal positional scheme (index‑based/rotary).
- Heads: (a) continuous next‑latent prediction; (b) optional multi‑horizon.
Decoding & OOV
- ANN cosine over concept store; if top‑1 cosine < τ, call vec2text; upsert (text, vector, meta) back to store.
Evaluation
- Text: BLEU‑4, ROUGE‑L between Expected Text and Output Text.
- Vector: cosine(GTR‑T5(Expected Text),z^) and angular error.
5) Non‑Functional Requirements
Stability: No collapse; rising InfoNCE accuracy; healthy batch variance.
Latency: p95 budget set and tracked with/without fallback.
Scale: Sharded ANN; online inserts ≤ 50 ms p95; read‑after‑write ≤ 1 s.
Reproducibility: Seeded segmentation; pinned model versions; manifest files.
Safety: Bounded generation length; deterministic fallback; observability.
6) Architecture
Input: Text → (optional) GTR‑T5 → 768D vectors → Latentizer packs sequences.
Core: Bridge 768→512 → Hybrid Mamba/Transformer → Heads predict next latent(s).
Output: 768D vector → ANN cosine decode → text; vec2text fallback if needed → store insert.
Data Flow (training): Expected Text → GTR‑T5 → z* (target vector) and concept sequence → model → z^ → metrics.
7) Data Strategy
Datasets: Use the same corpora as the reference model where allowed; ensure doc‑level split (80/10/10).
Segmentation: Punctuation + heuristics; optional small learned splitter; enforce target granularity (~17 words/concept).
Packing: Sliding windows with overlap; maintain doc boundaries for leakage control.
Normalization: Unit‑norm on all vectors pre/post bridge; mean‑free if needed.
8) Model Specification
Bridge: Linear(768→512) + LayerNorm + GELU (optional) + residual projection path.
Positional: Learned index bias or rotary on latent index.
Heads:
- AR next‑step: Predict xt+1 (512D), then project to 768D for metrics/decoding.
- Multi‑horizon: Additional heads for t+k, k∈{4,16}.
9) Objectives & Losses (Anti‑Collapse)
Autoregressive (AR): LAR=α⋅(1−cos(x^,x))+(1−α)∥x^−x∥22, α=0.7.
Masked Latent Modeling (MLM): 15–30% random masks; L2/cos reconstruction.
Contrastive (InfoNCE): Two views per sequence; temperature τ=0.07; large in‑batch negatives.
VICReg/Barlow Twins: Variance floor γ=1.0 after unit‑norm; decorrelate off‑diagonals.
Scheduled Sampling: Introduce 10–30% free‑running over epochs.
Default weights: λAR=1.0,λMLM=0.5,λNCE=0.5,λVIC=0.04.
10) Training Protocol
Curriculum on L: 1k → 2k → 4k → 8k (advance only after stability checkpoints).
Phases:
- A: MLM + denoise @ L=1k
- B: +AR @ L=2k
- C: +InfoNCE + multi‑horizon @ L=4k
- D: Scheduled sampling; scale to L=8k
Batching: Target global batch ≈ 1000 sequences via grad accumulation.
Regularization: Dropout 0.1; stochastic depth 0.1–0.2; grad clip 1.0; EMA teacher for contrastive target.
Precision: BF16/FP16 mix; activation checkpointing; ZeRO/sharding as needed.
11) Inference & Decoding
Step Output: 768D unit‑norm vector.
ANN Decode: FAISS/HNSW/ScaNN; top‑K=32; fusion score s=λcos+(1−λ) step_logit, λ=0.7.
Threshold τ: Start at 0.85; if top‑1 < τ → vec2text; cache and upsert new pair.
Caching: Per‑session LRU; prefetch neighbors around prior steps.
12) Evaluation (Acceptance Gates)
Text: BLEU‑4, ROUGE‑L between Expected Text and Output Text.
Vector: cosine(z∗,z^) and angular error; z∗=GTR‑T5(Expected Text).
Targets (v1): cosine avg ≥ 0.85 (p90 ≥ 0.80); BLEU‑4 ≥ 0.25; ROUGE‑L ≥ 0.45; fallback ≤ 10% and trending down.
No‑collapse: Per‑dim batch stddev ≥ γ; InfoNCE retrieval ↑; stable losses.
Reporting: Mean/median/std/95% CI; p50/p95 latency; OOV trend; domain slices.
13) Concept Store (ANN) Requirements
API: query(vector,K) → {id, cosine, text}; upsert(id?, vector, text, meta); get(id).
Index: Cosine; unit‑norm; HNSW for high recall or IVF‑PQ for memory tradeoffs.
Scale: Sharded by id/time; background reindex.
Consistency: Read‑after‑write ≤ 1 s.
Observability: Hit‑rate, τ‑failures, tail latency, growth, dedup stats.
14) Configuration Defaults
| Parameter | Default | Notes |
| Input dim | 768 | GTR‑T5 compatible |
| Bridge dim | 512 | compressive bottleneck |
| Context L | 1k → 2k → 4k → 8k | curriculum |
| Global batch | ~1000 sequences | via grad acc |
| Mask ratio | 0.2 | MLM |
| InfoNCE τ | 0.07 | temperature |
| Cos threshold τ | 0.85 | fallback trigger |
| ANN top‑K | 32 | re‑rank window |
| Fusion λ | 0.7 | cos vs logit |
| Dropout | 0.1 |
| Stoch. depth | 0.1–0.2 |
| Grad clip | 1.0 |
15) APIs (Sketch)
Training Latentizer
POST /latentize → {sequence_id, latents:[vec768...], meta}
Inference
POST /step → {vec_in_768} → {vec_out_768, score}
POST /decode → {vec_out_768} → {text, cosine, source}
Concept Store
POST /ann/query {vector, K} → {items:[{id, cosine, text}]}
POST /ann/upsert {id?, vector, text, meta} → {id}
16) Implementation Plan
Week 1: Latentizer v0; bridge; control vectors; tiny model @ L=1k trains with MLM+denoise.
Week 2: Add AR; cosine ≥ 0.80 on small eval; BLEU‑4 ≥ 0.20 / ROUGE‑L ≥ 0.40.
Week 3: Add InfoNCE + k‑head; L=2k/4k; integrate ANN decode (no fallback), dashboards.
Week 4: Add vec2text fallback + online inserts; L=4k; fallback ≤ 20%.
Week 5–6: Scale to L=8k; meet v1 gates; harden infra & docs; freeze test set.
Deliverables: Code modules (latentizer/bridge/model/heads/eval), ANN service, eval suite, runbooks, manifests.
17) Risks & Mitigations
| Risk | Impact | Mitigation |
| Representation collapse | Training fails | VICReg variance floor; InfoNCE w/ many negatives; multi‑horizon; EMA teacher; monitor stats |
| Long‑context instability | Convergence issues | Curriculum; LR schedule; checkpointing; grad clip |
| High fallback rate | Latency & cost | τ tuning; cache; online inserts; grow store; re‑rank top‑K |
| Mis‑alignment 512↔768 | Metric drop | Careful projections; unit‑norm; angular error monitoring |
| ANN drift/dupes | Decode errors | Cosine+text hash dedup; periodic compaction; QA |
18) Acceptance Criteria (Go/No‑Go)
Meets v1 targets (§12) on frozen test set.
Fallback ≤ 10% and dropping; p95 latency within budget.
No collapse indicators; stable training at L=8k.
Reproducible runs with seeded segmentation and pinned versions.
19) Extensions (Post‑v1)
VQ discrete latents (codebook entropy/usage regularizers).
Differentiable segmenter; Sequential‑GPS positions; TMD channelization.
Reranker over top‑K neighbors using internal scores.
20) Appendix
A. Metric Computation
Cosine: cos(z∗,z^)=z∗⋅z^∥z∗∥ ∥z^∥; report angular error arccos(cos).
Text: sacreBLEU (BLEU‑4), ROUGE‑L.
B. Minimal Pseudocode (Training)
x768 = latentizer(text_batch) # [B,L,768], unit-norm
x = bridge(layernorm(x768)) # [B,L,512]
y = model(x) # predicts next and k-step latents
loss = L_ar + L_mlm + L_infoNCE + L_vic
loss.backward(); clip_grad_norm_(params,1.0)
opt.step(); opt.zero_grad()
C. Decoding Logic
neighbors = ann.query(vec_out_768, topk=32)
if neighbors[0].cosine >= tau:
text = neighbors[0].text
else:
text = vec2text(vec_out_768)
ann.upsert(vector=vec_out_768, text=text, meta=...)
End of PRD v1.0