AI
Aging Theories Pipeline
Stage 2 of 5

Full-Text & Metadata Extraction

Multi-source retrieval with intelligent parsing and quality metrics

Collector Agent
90%+ Recovery Rate

Rationale & Novelty

Problem:

Collecting full texts from 108K papers is challenging due to paywalls, inconsistent APIs, varied PDF formats, and parsing errors. Single-source or single-parser approaches result in low recovery rates.

Solution:

Multi-source retrieval strategy (Sci-Hub, Unpaywall, PubMed, CrossRef) combined with dual PDF parsing (GROBID + PyMuPDF) and intelligent fallback logic to maximize full-text recovery.

Key Novelty:

  • Intelligent parser selection: Quality metrics (word count, structure detection) determine best parser per paper
  • Parallel GPU-accelerated processing: Up to 8x speedup using concurrent workers and GPU optimization
  • Comprehensive tracking: Real-time status monitoring, error recovery, and detailed logging for reproducibility

Key Features

Multi-Source Retrieval

Cascading fallback through Sci-Hub, Unpaywall, PubMed Central, and CrossRef APIs to maximize full-text availability.

Result: 90%+ recovery rate vs 60-70% single-source

Dual PDF Parsing

GROBID for structured extraction (sections, citations) and PyMuPDF for raw text, with automatic quality-based selection.

Metrics: Word count, section detection, formatting preservation

GPU Acceleration

Parallel processing with GPU-optimized PDF parsing achieves 8x speedup over sequential processing.

Throughput: ~1,000 papers/hour on single GPU

Production-Ready Infrastructure

Docker containerization, YAML configuration, REST API, and comprehensive error handling for reliability.

Features: Checkpointing, resume capability, status monitoring

Technical Implementation

Retrieval Pipeline

1
Source Priority
Try Sci-Hub first (fastest), then Unpaywall (open access), then PubMed Central, finally CrossRef
2
PDF Download
Parallel downloads with retry logic, timeout handling, and checksum verification
3
Dual Parsing
Run both GROBID and PyMuPDF parsers concurrently, compare quality metrics
4
Quality Selection
Choose parser output based on: word count >1000, section detection, reference extraction
5
Storage
Store full text, metadata, provenance (source, parser used), and quality scores in SQLite

Parser Comparison

GROBID
  • • Structured extraction (sections, citations)
  • • Better for well-formatted PDFs
  • • Slower but more accurate
PyMuPDF
  • • Raw text extraction
  • • Better for complex layouts
  • • Faster, more robust

Results & Impact

90%+
Full-Text Recovery
High-fidelity corpus for LLM extraction
8x
Processing Speedup
GPU-accelerated parallel processing
2
Parser Types
Intelligent quality-based selection