Collecting full texts from 108K papers is challenging due to paywalls, inconsistent APIs, varied PDF formats, and parsing errors. Single-source or single-parser approaches result in low recovery rates.
Multi-source retrieval strategy (Sci-Hub, Unpaywall, PubMed, CrossRef) combined with dual PDF parsing (GROBID + PyMuPDF) and intelligent fallback logic to maximize full-text recovery.
Cascading fallback through Sci-Hub, Unpaywall, PubMed Central, and CrossRef APIs to maximize full-text availability.
GROBID for structured extraction (sections, citations) and PyMuPDF for raw text, with automatic quality-based selection.
Parallel processing with GPU-optimized PDF parsing achieves 8x speedup over sequential processing.
Docker containerization, YAML configuration, REST API, and comprehensive error handling for reliability.