TC
← All Research
Product Requirements Document (PRD)
PRDGeneral AI Theory

Product Requirements Document (PRD)

Vector‑Only Latent Nemotron‑Nano‑2 (Hybrid Mamba)

2025-09-077 min read1,370 words
Trent Carter + ChatGPT

Product Requirements Document (PRD)

Title

Vector‑Only Latent Nemotron‑Nano‑2 (Hybrid Mamba)

Version

v1.0 (formal draft)

Owners

  • Product/Research: Trent Carter & Partner (ChatGPT)
  • Tech Lead: TBD
  • ML/Infra: TBD

  • 1) Executive Summary

    We will adapt a Nemotron‑Nano‑2–style hybrid Mamba/Transformer into a vector‑only latent model. The core model ingests/produces sequences of latent vectors instead of text tokens. This preserves Mamba’s sequential inductive bias while removing a fixed token vocabulary. A cloud concept store (ANN cosine) provides fast text reconstruction via nearest neighbors; a vec2text fallback handles out‑of‑vocabulary (OOV) cases and bootstraps the store online.

    Why now: Enables language‑free reasoning and concept‑native interfaces, while remaining interoperable with GTR‑T5 (768D) and vec2text (IELAB/JXE) for evaluation and text I/O.

    2) Goals & Non‑Goals

    Goals

  • Vector‑native I/O: Replace token embeddings with latent vectors (default 768D) and special control vectors (, , , ).
  • Long‑context learning: Achieve stable training with latent sequence length target L ≈ 8k via curriculum.
  • Robust decoding: ANN cosine decoding with vec2text fallback and online concept insertion.
  • Non‑collapse: Prevent representation collapse using AR/MLM/InfoNCE/VICReg + multi‑horizon.
  • Eval you mandated: BLEU‑4 & ROUGE‑L (Expected Text vs Output Text), and cosine between GTR‑T5(Expected Text) and model output vector.
  • Non‑Goals (v1)

  • Multimodal latent training; learned VQ codebooks at scale (optional later); RL alignment.

  • 3) User Stories

  • R&D Engineer: “I can pass sequences of latent concept vectors, train at long context, and decode to text for inspection.”
  • Systems Engineer: “The concept store scales, supports online inserts, and provides deterministic, low‑latency nearest‑neighbor lookups.”
  • Evaluator: “I can run standardized BLEU/ROUGE and cosine metrics and compare runs over time.”

  • 4) Functional Requirements

  • Latent Interface
  • - Input vectors z∈R768 (pluggable dim), unit‑norm; bridge to internal dim (default 512) with LayerNorm.

    - Special control latents: trainable , , , .

    - Output step: 768D unit‑norm vector per position.

  • Training‑Time Latentizer
  • - Deterministic clause/phrase segmentation (~17 words → 1 concept on average).

    - Pack sequences up to L; prepend , append , pad.

    - Generate two stochastic views per sequence for contrastive learning (paraphrase/noise/segment jitter).

  • Backbone & Heads
  • - Replace token embedding with bridge(768→512).

    - Keep Mamba/Transformer blocks; minimal positional scheme (index‑based/rotary).

    - Heads: (a) continuous next‑latent prediction; (b) optional multi‑horizon.

  • Decoding & OOV
  • - ANN cosine over concept store; if top‑1 cosine < τ, call vec2text; upsert (text, vector, meta) back to store.

  • Evaluation
  • - Text: BLEU‑4, ROUGE‑L between Expected Text and Output Text.

    - Vector: cosine(GTR‑T5(Expected Text),z^) and angular error.


    5) Non‑Functional Requirements

  • Stability: No collapse; rising InfoNCE accuracy; healthy batch variance.
  • Latency: p95 budget set and tracked with/without fallback.
  • Scale: Sharded ANN; online inserts ≤ 50 ms p95; read‑after‑write ≤ 1 s.
  • Reproducibility: Seeded segmentation; pinned model versions; manifest files.
  • Safety: Bounded generation length; deterministic fallback; observability.

  • 6) Architecture

  • Input: Text → (optional) GTR‑T5 → 768D vectors → Latentizer packs sequences.
  • Core: Bridge 768→512 → Hybrid Mamba/Transformer → Heads predict next latent(s).
  • Output: 768D vector → ANN cosine decode → text; vec2text fallback if needed → store insert.
  • Data Flow (training): Expected Text → GTR‑T5 → z* (target vector) and concept sequence → model → z^ → metrics.

    7) Data Strategy

  • Datasets: Use the same corpora as the reference model where allowed; ensure doc‑level split (80/10/10).
  • Segmentation: Punctuation + heuristics; optional small learned splitter; enforce target granularity (~17 words/concept).
  • Packing: Sliding windows with overlap; maintain doc boundaries for leakage control.
  • Normalization: Unit‑norm on all vectors pre/post bridge; mean‑free if needed.

  • 8) Model Specification

  • Bridge: Linear(768→512) + LayerNorm + GELU (optional) + residual projection path.
  • Positional: Learned index bias or rotary on latent index.
  • Heads:
  • - AR next‑step: Predict xt+1 (512D), then project to 768D for metrics/decoding.

    - Multi‑horizon: Additional heads for t+k, k∈{4,16}.


    9) Objectives & Losses (Anti‑Collapse)

  • Autoregressive (AR): LAR=α⋅(1−cos⁡(x^,x))+(1−α)∥x^−x∥22, α=0.7.
  • Masked Latent Modeling (MLM): 15–30% random masks; L2/cos reconstruction.
  • Contrastive (InfoNCE): Two views per sequence; temperature τ=0.07; large in‑batch negatives.
  • VICReg/Barlow Twins: Variance floor γ=1.0 after unit‑norm; decorrelate off‑diagonals.
  • Scheduled Sampling: Introduce 10–30% free‑running over epochs.
  • Default weights: λAR=1.0,λMLM=0.5,λNCE=0.5,λVIC=0.04.

    10) Training Protocol

  • Curriculum on L: 1k → 2k → 4k → 8k (advance only after stability checkpoints).
  • Phases:
  • - A: MLM + denoise @ L=1k

    - B: +AR @ L=2k

    - C: +InfoNCE + multi‑horizon @ L=4k

    - D: Scheduled sampling; scale to L=8k

  • Batching: Target global batch ≈ 1000 sequences via grad accumulation.
  • Regularization: Dropout 0.1; stochastic depth 0.1–0.2; grad clip 1.0; EMA teacher for contrastive target.
  • Precision: BF16/FP16 mix; activation checkpointing; ZeRO/sharding as needed.

  • 11) Inference & Decoding

  • Step Output: 768D unit‑norm vector.
  • ANN Decode: FAISS/HNSW/ScaNN; top‑K=32; fusion score s=λcos⁡+(1−λ) step_logit, λ=0.7.
  • Threshold τ: Start at 0.85; if top‑1 < τ → vec2text; cache and upsert new pair.
  • Caching: Per‑session LRU; prefetch neighbors around prior steps.

  • 12) Evaluation (Acceptance Gates)

  • Text: BLEU‑4, ROUGE‑L between Expected Text and Output Text.
  • Vector: cosine(z∗,z^) and angular error; z∗=GTR‑T5(Expected Text).
  • Targets (v1): cosine avg ≥ 0.85 (p90 ≥ 0.80); BLEU‑4 ≥ 0.25; ROUGE‑L ≥ 0.45; fallback ≤ 10% and trending down.
  • No‑collapse: Per‑dim batch stddev ≥ γ; InfoNCE retrieval ↑; stable losses.
  • Reporting: Mean/median/std/95% CI; p50/p95 latency; OOV trend; domain slices.

  • 13) Concept Store (ANN) Requirements

  • API: query(vector,K) → {id, cosine, text}; upsert(id?, vector, text, meta); get(id).
  • Index: Cosine; unit‑norm; HNSW for high recall or IVF‑PQ for memory tradeoffs.
  • Scale: Sharded by id/time; background reindex.
  • Consistency: Read‑after‑write ≤ 1 s.
  • Observability: Hit‑rate, τ‑failures, tail latency, growth, dedup stats.

  • 14) Configuration Defaults

    ParameterDefaultNotes Input dim768GTR‑T5 compatible Bridge dim512compressive bottleneck Context L1k → 2k → 4k → 8kcurriculum Global batch~1000 sequencesvia grad acc Mask ratio0.2MLM InfoNCE τ0.07temperature Cos threshold τ0.85fallback trigger ANN top‑K32re‑rank window Fusion λ0.7cos vs logit Dropout0.1 Stoch. depth0.1–0.2 Grad clip1.0

    15) APIs (Sketch)

    Training Latentizer
  • POST /latentize → {sequence_id, latents:[vec768...], meta}
  • Inference
  • POST /step → {vec_in_768} → {vec_out_768, score}
  • POST /decode → {vec_out_768} → {text, cosine, source}
  • Concept Store
  • POST /ann/query {vector, K} → {items:[{id, cosine, text}]}
  • POST /ann/upsert {id?, vector, text, meta} → {id}

  • 16) Implementation Plan

  • Week 1: Latentizer v0; bridge; control vectors; tiny model @ L=1k trains with MLM+denoise.
  • Week 2: Add AR; cosine ≥ 0.80 on small eval; BLEU‑4 ≥ 0.20 / ROUGE‑L ≥ 0.40.
  • Week 3: Add InfoNCE + k‑head; L=2k/4k; integrate ANN decode (no fallback), dashboards.
  • Week 4: Add vec2text fallback + online inserts; L=4k; fallback ≤ 20%.
  • Week 5–6: Scale to L=8k; meet v1 gates; harden infra & docs; freeze test set.
  • Deliverables: Code modules (latentizer/bridge/model/heads/eval), ANN service, eval suite, runbooks, manifests.

    17) Risks & Mitigations

    RiskImpactMitigation Representation collapseTraining failsVICReg variance floor; InfoNCE w/ many negatives; multi‑horizon; EMA teacher; monitor stats Long‑context instabilityConvergence issuesCurriculum; LR schedule; checkpointing; grad clip High fallback rateLatency & costτ tuning; cache; online inserts; grow store; re‑rank top‑K Mis‑alignment 512↔768Metric dropCareful projections; unit‑norm; angular error monitoring ANN drift/dupesDecode errorsCosine+text hash dedup; periodic compaction; QA

    18) Acceptance Criteria (Go/No‑Go)

  • Meets v1 targets (§12) on frozen test set.
  • Fallback ≤ 10% and dropping; p95 latency within budget.
  • No collapse indicators; stable training at L=8k.
  • Reproducible runs with seeded segmentation and pinned versions.

  • 19) Extensions (Post‑v1)

  • VQ discrete latents (codebook entropy/usage regularizers).
  • Differentiable segmenter; Sequential‑GPS positions; TMD channelization.
  • Reranker over top‑K neighbors using internal scores.

  • 20) Appendix

    A. Metric Computation

  • Cosine: cos⁡(z∗,z^)=z∗⋅z^∥z∗∥ ∥z^∥; report angular error arccos⁡(cos⁡).
  • Text: sacreBLEU (BLEU‑4), ROUGE‑L.
  • B. Minimal Pseudocode (Training)

    x768 = latentizer(text_batch) # [B,L,768], unit-norm
    

    x = bridge(layernorm(x768)) # [B,L,512]

    y = model(x) # predicts next and k-step latents

    loss = L_ar + L_mlm + L_infoNCE + L_vic

    loss.backward(); clip_grad_norm_(params,1.0)

    opt.step(); opt.zero_grad()

    C. Decoding Logic

    neighbors = ann.query(vec_out_768, topk=32)
    

    if neighbors[0].cosine >= tau:

    text = neighbors[0].text

    else:

    text = vec2text(vec_out_768)

    ann.upsert(vector=vec_out_768, text=text, meta=...)


    End of PRD v1.0

    Related Research