Multi-stage LLM orchestration for canonical theory mapping
Theory names are inconsistent across papers; many extracted "theories" are spurious, overly specific, or duplicates with different names.
Multi-stage LLM pipeline extracts, normalizes, validates, and clusters theory mentions, mapping them to a canonical ontology.
LLM validates if extracted mention is a genuine aging theory vs spurious extraction
RapidFuzz algorithm groups similar theory names (e.g., 'Free Radical Theory' ↔ 'Free Radical Theory of Aging')
Map to known theories list with high confidence, reducing ambiguity
Group theories by name similarity and mechanism overlap
LLM reviews groups, splits over-merged clusters, merges synonyms
Validate each canonical theory against aging theory definition and mechanism coherence
Embeddings + clustering + LLM refinement: assign small clusters to large, split overly generic names
Every theory mention tracked through all stages with full audit trail
When LLM and code fail to match a theory to any known name, it could be either novel or invalid. We use LLM to determine validity based on rigorous criteria.
Name vs. Concepts Mismatch: A theory name might seem valid but concepts reveal it's not an aging theory (or vice versa). This dual validation is crucial for accuracy.
Output: Only valid theories proceed → output/theory_tracking_report.json
The culmination of extensive multi-stage normalization work
Extensive multi-stage processing transforms raw extractions into clean, canonical theories
Compression Ratio: From noise to signal through rigorous validation