Data Curation Options & Resource Requirements
Trent carter
8/5/2025
Curation Strategy Options
| Strategy | Quality | Scale | CPU Hours | GPU Hours | Internet Queries | Processing Time | Cost Estimate |
| Manual + LLM Validation | Excellent | 10M max | 2,000 | 200 | 50K | 3-4 weeks | $500-1K |
| Automated + Two-Stage LLM | Very Good | 100M | 500 | 100 | 500K | 1-2 weeks | $200-500 |
| Hybrid: Seeds + Expansion | High | 1B | 1,000 | 300 | 1M | 2-3 weeks | $800-1.5K |
| Pure Automated Pipeline | Good | 1B+ | 200 | 50 | 10M | 3-5 days | $100-300 |
| Crystallization Method | Very High | 500M | 800 | 400 | 2M | 1.5-2 weeks | $1K-2K |
Data Sources & Yields
| Source | Raw Size | Concept Yield | Quality | Extraction Rate | Special Requirements |
| The Stack v2 | 67.5TB | 50M code concepts | High | 1M/day | AST parsing, sandboxing |
| C4 (Cleaned) | 750GB | 200M text concepts | Medium | 5M/day | Language detection |
| Wikipedia | 100GB | 30M concepts | High | 2M/day | Entity linking |
| arXiv Papers | 500GB | 20M scientific | Very High | 500K/day | PDF parsing, citations |
| ConceptNet 5.7 | 1GB | 8M relations | Excellent | 8M/day | Already structured |
| Wikidata | 120GB | 100M entities | Excellent | 10M/day | SPARQL queries |
| CodeContests | 2GB | 50K code patterns | Excellent | 50K/day | Self-validating |
Resource Scaling by Concept Count
| Concepts | Storage (Parquet) | RAM (Loading) | FAISS Index | Training RAM | Inference RAM | Build Time |
| 10K | 50MB | 100MB | 10MB | 2GB | 500MB | 1 hour |
| 100K | 500MB | 1GB | 100MB | 4GB | 1GB | 4 hours |
| 1M | 5GB | 10GB | 1GB | 8GB | 2GB | 12 hours |
| 10M | 50GB | 50GB | 10GB | 20GB | 4GB | 2 days |
| 100M | 500GB | 200GB | 100GB | 40GB | 8GB | 1 week |
| 1B | 5TB | 1TB | 1TB | 80GB | 16GB | 2-3 weeks |
*Requires memory mapping and batch loading
Storage Schema Breakdown (per concept)
| Component | 768D (bytes) | 1536D (bytes) | Multi-Dim (bytes) |
| Concept ID | 16 | 16 | 16 |
| Text | 200 (avg) | 200 | 200 |
| Embeddings | 3,072 | 6,144 | 9,216 |
| Projections | 0 | 0 | 4,608 |
| Relations | 100 (avg) | 100 | 100 |
| Metadata | 150 | 150 | 150 |
| Domain Info | 50 | 50 | 50 |
| Total per concept | ~3.6KB | ~6.7KB | ~14.3KB |
Recommended Phased Approach
Phase 1: Proof of Concept (10K concepts)
Sources: CodeContests + Wikipedia sample + ConceptNet subset
Time: 2-3 days
Resources: 1 CPU, local processing
Storage: 50MB
Purpose: Validate pipeline, test both VMM and Diffusion
Phase 2: Development Dataset (100K concepts)
Sources: The Stack (Python subset) + arXiv abstracts + ConceptNet
Time: 1 week
Resources: M4 Mac full utilization
Storage: 500MB
Purpose: Full model training, architecture comparison
Phase 3: Production Dataset (10M concepts)
Sources: Multi-language code + scientific papers + knowledge graphs
Time: 2 weeks
Resources: M4 Mac + some cloud processing
Storage: 50GB
Purpose: Real-world performance validation
Phase 4: Full Scale (100M+ concepts)
Sources: Full Stack v2 + C4 subset + all knowledge sources
Time: 3-4 weeks
Resources: Hybrid local/cloud
Storage: 500GB-5TB
Purpose: Production-ready model
Critical "Get It Right First Time" Considerations
1. Schema Design
Multi-dimensional embeddings: Store all dimensions from day 1
Projection matrices: Learn 768→384 and 1536→768 mappings
Versioning: Track embedding model versions for reproducibility
Temporal stamps: Enable training data evolution analysis
2. Quality Assurance Pipeline
Validation scores: Every concept gets multiple validation passes
Source tracking: Maintain provenance for debugging
Relationship confidence: Weight connections by validation strength
Domain confidence: Track classification uncertainty
3. Scalability Preparation
Sharded storage: Design for horizontal scaling from start
Incremental updates: Support adding new concepts without rebuild
Memory mapping: Handle datasets larger than RAM
Distributed indexing: FAISS clustering for massive scales
4. Testing Integration
Held-out validation: 10% of each domain reserved for testing
Temporal splits: Early concepts for training, later for validation
Cross-domain evaluation: Test physics expert on biology concepts
Adversarial samples: Include deliberately challenging cases