Stage 4: Theory Extraction & Normalization

Multi-stage LLM orchestration for canonical theory mapping

Rationale & Novelty

Problem

Theory names are inconsistent across papers; many extracted "theories" are spurious, overly specific, or duplicates with different names.

Solution

Multi-stage LLM pipeline extracts, normalizes, validates, and clusters theory mentions, mapping them to a canonical ontology.

Key Novelty

Mechanism-based extraction: Key concepts extracted alongside names, enabling mechanism-aware clustering
Hybrid normalization: Combines fuzzy matching, LLM mapping, and semantic clustering
Paper focus weighting: All theories extracted, but only strong theory-paper links (focus ≥ 6/10) retained
Iterative refinement: 8-stage normalization pipeline with shuffle-group/split cycles

8-Stage Normalization Pipeline

Quality Filter

LLM validates if extracted mention is a genuine aging theory vs spurious extraction

Fuzzy Matching

RapidFuzz algorithm groups similar theory names (e.g., 'Free Radical Theory' ↔ 'Free Radical Theory of Aging')

1.5

LLM Mapping (100% confidence)

Map to known theories list with high confidence, reducing ambiguity

Initial Grouping

Group theories by name similarity and mechanism overlap

Refinement

LLM reviews groups, splits over-merged clusters, merges synonyms

LLM Validation

Validate each canonical theory against aging theory definition and mechanism coherence

5-7

Mechanism-Based Clustering

Embeddings + clustering + LLM refinement: assign small clusters to large, split overly generic names

Final

Provenance Tracking

Every theory mention tracked through all stages with full audit trail

Deep Dive: Advanced Clustering Methodology (Stages 5-7)

Step 4: Novel Theory Validation

When LLM and code fail to match a theory to any known name, it could be either novel or invalid. We use LLM to determine validity based on rigorous criteria.

Validation Inputs

Criteria: Valid aging theory definition (from Stage 3)
Theory name: Extracted name from paper
Paper title: Context for evaluation
Key concepts: Mechanisms and biological processes

Critical Insight

Name vs. Concepts Mismatch: A theory name might seem valid but concepts reveal it's not an aging theory (or vice versa). This dual validation is crucial for accuracy.

Output: Only valid theories proceed → output/theory_tracking_report.json

Results & Impact

The culmination of extensive multi-stage normalization work

Complete Pipeline Progress

Stage 1

108K+

Papers

Stage 2

90%+

Full Texts

Stage 3

30K+

Validated

Stage 4

28K+

Extracted

8-Stage Normalization Pipeline

Extensive multi-stage processing transforms raw extractions into clean, canonical theories

28,000+

Initial Theories

From 30K+ papers

8 Stages

Normalization Process

Validation & Clustering

2,141

Canonical Theories

From 15,813 papers

Final Result

28,000+

Raw Theories

12.9:1

2,141

Clean Theories

Compression Ratio: From noise to signal through rigorous validation

Final Result

Back to Overview

Stage 4: Theory Extraction & Normalization

Multi-stage LLM orchestration for canonical theory mapping

Rationale & Novelty

Problem

Theory names are inconsistent across papers; many extracted "theories" are spurious, overly specific, or duplicates with different names.

Solution

Multi-stage LLM pipeline extracts, normalizes, validates, and clusters theory mentions, mapping them to a canonical ontology.

Key Novelty

Mechanism-based extraction: Key concepts extracted alongside names, enabling mechanism-aware clustering
Hybrid normalization: Combines fuzzy matching, LLM mapping, and semantic clustering
Paper focus weighting: All theories extracted, but only strong theory-paper links (focus ≥ 6/10) retained
Iterative refinement: 8-stage normalization pipeline with shuffle-group/split cycles

8-Stage Normalization Pipeline

Quality Filter

LLM validates if extracted mention is a genuine aging theory vs spurious extraction

Fuzzy Matching

RapidFuzz algorithm groups similar theory names (e.g., 'Free Radical Theory' ↔ 'Free Radical Theory of Aging')

1.5

LLM Mapping (100% confidence)

Map to known theories list with high confidence, reducing ambiguity

Initial Grouping

Group theories by name similarity and mechanism overlap

Refinement

LLM reviews groups, splits over-merged clusters, merges synonyms

LLM Validation

Validate each canonical theory against aging theory definition and mechanism coherence

5-7

Mechanism-Based Clustering

Embeddings + clustering + LLM refinement: assign small clusters to large, split overly generic names

Final

Provenance Tracking

Every theory mention tracked through all stages with full audit trail

Deep Dive: Advanced Clustering Methodology (Stages 5-7)

Step 4: Novel Theory Validation

When LLM and code fail to match a theory to any known name, it could be either novel or invalid. We use LLM to determine validity based on rigorous criteria.

Validation Inputs

Criteria: Valid aging theory definition (from Stage 3)
Theory name: Extracted name from paper
Paper title: Context for evaluation
Key concepts: Mechanisms and biological processes

Critical Insight

Name vs. Concepts Mismatch: A theory name might seem valid but concepts reveal it's not an aging theory (or vice versa). This dual validation is crucial for accuracy.

Output: Only valid theories proceed → output/theory_tracking_report.json

Results & Impact

The culmination of extensive multi-stage normalization work

Complete Pipeline Progress

Stage 1

108K+

Papers

Stage 2

90%+

Full Texts

Stage 3

30K+

Validated

Stage 4

28K+

Extracted

8-Stage Normalization Pipeline

Extensive multi-stage processing transforms raw extractions into clean, canonical theories

28,000+

Initial Theories

From 30K+ papers

8 Stages

Normalization Process

Validation & Clustering

2,141

Canonical Theories

From 15,813 papers

Final Result

28,000+

Raw Theories

12.9:1

2,141

Clean Theories

Compression Ratio: From noise to signal through rigorous validation

Final Result