AI
Aging Theories Pipeline
Back to Overview

Stage 4: Theory Extraction & Normalization

Multi-stage LLM orchestration for canonical theory mapping

Rationale & Novelty

Problem

Theory names are inconsistent across papers; many extracted "theories" are spurious, overly specific, or duplicates with different names.

Solution

Multi-stage LLM pipeline extracts, normalizes, validates, and clusters theory mentions, mapping them to a canonical ontology.

Key Novelty

  • Mechanism-based extraction: Key concepts extracted alongside names, enabling mechanism-aware clustering
  • Hybrid normalization: Combines fuzzy matching, LLM mapping, and semantic clustering
  • Paper focus weighting: All theories extracted, but only strong theory-paper links (focus ≥ 6/10) retained
  • Iterative refinement: 8-stage normalization pipeline with shuffle-group/split cycles

8-Stage Normalization Pipeline

0

Quality Filter

LLM validates if extracted mention is a genuine aging theory vs spurious extraction

1

Fuzzy Matching

RapidFuzz algorithm groups similar theory names (e.g., 'Free Radical Theory' ↔ 'Free Radical Theory of Aging')

1.5

LLM Mapping (100% confidence)

Map to known theories list with high confidence, reducing ambiguity

2

Initial Grouping

Group theories by name similarity and mechanism overlap

3

Refinement

LLM reviews groups, splits over-merged clusters, merges synonyms

4

LLM Validation

Validate each canonical theory against aging theory definition and mechanism coherence

5-7

Mechanism-Based Clustering

Embeddings + clustering + LLM refinement: assign small clusters to large, split overly generic names

Final

Provenance Tracking

Every theory mention tracked through all stages with full audit trail

Deep Dive: Advanced Clustering Methodology (Stages 5-7)

Step 4: Novel Theory Validation

When LLM and code fail to match a theory to any known name, it could be either novel or invalid. We use LLM to determine validity based on rigorous criteria.

Validation Inputs

  • Criteria: Valid aging theory definition (from Stage 3)
  • Theory name: Extracted name from paper
  • Paper title: Context for evaluation
  • Key concepts: Mechanisms and biological processes

Critical Insight

Name vs. Concepts Mismatch: A theory name might seem valid but concepts reveal it's not an aging theory (or vice versa). This dual validation is crucial for accuracy.

Output: Only valid theories proceed → output/theory_tracking_report.json

Results & Impact

The culmination of extensive multi-stage normalization work

Complete Pipeline Progress

Stage 1
108K+
Papers
Stage 2
90%+
Full Texts
Stage 3
30K+
Validated
Stage 4
28K+
Extracted

8-Stage Normalization Pipeline

Extensive multi-stage processing transforms raw extractions into clean, canonical theories

28,000+
Initial Theories
From 30K+ papers
8 Stages
Normalization Process
Validation & Clustering
2,141
Canonical Theories
From 15,813 papers
Final Result
28,000+
Raw Theories
12.9:1
2,141
Clean Theories

Compression Ratio: From noise to signal through rigorous validation

Final Result