TC
← All Research
Data Curation Sources
ReferenceGeneral AI Theory

Data Curation Sources

8/7/2025

2025-08-072 min read321 words

8/7/2025

1. High-Quality Seed Datasets with Natural Relationships

DatasetRelationship TypesScaleQuality ConceptNet 5.7IsA, PartOf, UsedFor, RelatedTo, HasContext8M edgesHigh - human curated WikidataP31 (instance), P279 (subclass), P361 (part of)100M+ itemsVery High - structured WordNetHypernym, Hyponym, Meronym, Holonym155K synsetsExcellent - linguistic gold standard ATOMIC 2020Causes, Effects, Intents, Reactions1.33M inferencesHigh - commonsense reasoning Visual GenomeSpatial, Attribute, Action relationships3.8M relationshipsGood - grounded in images SciGraphCitations, Methods, Results, Hypotheses15M papersDomain-specific excellence

Code-Focused Datasets for Concept Training

DatasetSizeQuality FeaturesConcept Extraction Value The Stack v267.5TB, 600+ languagesPermissively licensed, deduplicatedMassive scale, multi-paradigm CodeParrot50GB PythonClean, well-documentedPure Python focus CodeContests13k problemsSolutions + test casesSelf-validating logic APPS10k problemsDifficulty levels, test suitesProgressive complexity HumanEval-X820 problems × 5 languagesHand-written testsCross-lingual concepts MBPP1000 Python tasksNatural language → codeConcept bridging CodeXGLUE14 tasksUnderstanding + generationSemantic code relationships

relations:

Code-Specific Concept Relations

python

CODE_SPECIFIC_RELATIONS = [

"implements", # Function implements algorithm

"optimizes", # Better version of another approach

"generalizes", # More general version

"specializes", # More specific version

"tests", # Test case for concept

"depends_on", # Requires other concept

"parallel_to", # Can run concurrently with

"inverse_of", # Undo operation

"composed_of", # Built from smaller concepts

]

Related Research