🧩 Prepositional vs Semantic Chunking
| Metric / Feature | Prepositional Chunking | Semantic Chunking |
| Definition | Splits text based on logical propositions (e.g., clauses, prepositions, conjunctions). | Splits text based on semantic similarity using embeddings (e.g., sentence transformers). |
| Granularity | Fine-grained: clause-level or logical unit segmentation. | Variable: sentence-level or multi-sentence chunks based on meaning. |
| Context Preservation | Moderate: may lose broader context across clauses. | High: preserves semantic coherence across chunks. |
| Chunk Size Variability | Low to moderate (clauses are often short). | High (semantic similarity may group multiple sentences). |
| Computational Cost | Moderate: requires syntactic parsing. | High: requires embedding generation and clustering. |
| Speed | 🟡 Medium | 🔴 Slower |
| Accuracy in Retrieval Tasks | Good for logic-heavy or entailment tasks. | Excellent for semantic search and RAG. |
| Robustness Across Domains | Limited: depends on grammar and sentence structure. | High: adapts well to diverse domains. |
| Overlap Support | Rarely used. | Common: overlapping windows improve recall. |
| Noise Sensitivity | High: punctuation errors or grammar issues degrade performance. | Moderate: embeddings are more tolerant to noise. |
| Best Use Cases | Reasoning, entailment, legal texts, logic-heavy documents. | Semantic search, RAG, summarization, clustering. |
| Example Input | “Although it was raining, we went outside.” → [“Although it was raining”, “we went outside”] | “We went outside despite the rain.” → grouped with similar weather-related sentences. |
| Evaluation Metrics | Clause precision, syntactic coverage, logical completeness. | Embedding coherence, retrieval recall, semantic overlap. |
| Library | Prepositional Chunking | Semantic Chunking |
| spaCy | ✅ Dependency parsing for clause extraction. | ❌ Not designed for semantic chunking. |
| NLTK | ✅ Grammar-based chunking with regex and POS tags. | ❌ Limited semantic capabilities. |
| LangChain / LlamaIndex | ❌ Not built-in, but can be customized. | ✅ Native support for semantic chunking via embeddings. |
| Transformers (HuggingFace) | ❌ Parsing not supported. | ✅ Embedding models like BGE, MiniLM, etc. |
| Chunking_RAG (GitHub) | 🟡 Experimental support via agentic chunkers. | ✅ Semantic chunking with nomic-embed-text, etc. |
Sources: 2
🧪 Example Use Case: C4 Dataset
Let’s say you’re working with the first million samples of C4:
Prepositional Chunking: Use spaCy to parse each sentence and extract clauses. Ideal for logic-heavy QA or entailment tasks.
Semantic Chunking: Use LangChain or LlamaIndex with BGE embeddings to group semantically similar sentences. Ideal for RAG or semantic search.