Abstract
Purpose
This study presents and evaluates an advanced AI system powered by a large language model (LLM) for structured information extraction from breast cancer histopathology reports.
-
Demonstrates human-level accuracy using zero-shot prompting without prior model fine-tuning
-
Successfully extracts 51 complex clinical features from unstructured medical text
-
Introduces the open-source Medical Report Information Extractor tool to expand accessibility for non-programmers
Scope
The research addresses a critical challenge in clinical informatics: transforming unstructured medical data into usable, structured formats at scale.
-
Scalable Data Extraction: Enables high-volume processing of pathology reports without manual annotation
-
Clinical Accuracy: Matches expert-level performance in identifying detailed clinical features
-
Accessibility & Usability: Provides an open-source solution tailored to users without programming expertise
-
Privacy & Cost Efficiency: Supports self-hosting to protect patient data and reduce infrastructure costs
-
Open Tooling: Facilitates widespread adoption through transparent, shareable resources
This work provides a significant advancement for healthcare systems and medical researchers, offering a practical path toward large-scale integration of LLMs in clinical workflows.
In clinical and research environments, pathology reports are typically in unstructured formats, making it difficult to extract usable data. Manual extraction is costly, slow, and often error-prone. This study presents a scalable, efficient alternative by employing large language models, enabling automated extraction of structured data dictated by a study-specific data dictionary.
The team developed a methodology using zero-shot prompting—a technique where the model receives only task instructions, without needing training data. This makes deployment accessible, even when labeled data is unavailable. Prompts were tuned using a small training set but designed to generalize to new, unseen pathology reports with high performance.
Researchers created a modular web application—“Medical Report Information Extractor”—which connects to various LLMs via APIs. The app extracts structured outputs (JSON) and optionally converts them into standardized formats (JSON-LD) using SNOMED CT. Users can modify the behavior of the tool through three simple and human-readable configuration files, enabling use by non-programmers.
Five state-of-the-art LLMs were evaluated, including OpenAI’s proprietary GPT-4o and open-source self-hostable Meta’s Llama 3 models (405B, 70B, and 8B). GPT-4o reached 96.1% accuracy, closely followed by Llama 3.1 405B at 94.7%, both comparable to the human annotator. These findings support the viability of using LLMs as cost-effective and scalable alternatives to manual annotation.
While GPT-4o demonstrated the highest accuracy, it incurred the highest processing cost. Llama 3.1 70B offered a strong balance between performance and cost, making it attractive for self-hosted deployments. The smallest, portable model (Llama 3.1 8B) remained underpowered, though promising for future on-device applications with privacy benefits.
Recognizing the sensitivity of medical data, the study emphasizes self-hosting capabilities to preserve privacy. Additionally, the extracted outputs were mapped to SNOMED CT terms using Linked Data (JSON-LD format), enabling interoperability for downstream research—a significant step towards FAIR data principles in clinical informatics.
To fairly assess performance, a new gold standard dataset was developed through conflict resolution between GPT-4o and human annotator outputs. An independent physician resolved disagreements, and cases were reviewed for OCR errors. This rigorous methodology ensured the reliability of evaluation metrics.
The study identifies ambiguities inherent in pathology report structures and domain-specific data dictionaries—highlighting the need for better documentation standards and consistent terminology. It also notes the risk introduced by OCR errors and suggests extending evaluations to multilingual and multi-specimen reports.
The report emphasizes that foundation models, including LLMs, are uniquely positioned to enable rapid development of intelligent, generalizable applications across domains. However, version control, validation benchmarks, and adaptability to model drift must be maintained. The design of the Medical Report Information Extractor aligns well with these future-proofing practices.
The software is open-sourced and available for adaptation across other types of clinical or non-clinical text. Future work includes integrating multimodal models to eliminate OCR, automating prompt engineering, and fine-tuning smaller models for optimized performance on specific tasks—opening doors to even greater accessibility.
This study represents a pivotal step toward transforming biomedical data extraction workflows using AI. By combining the power of foundation models, a configurable user interface, and international data standards, the authors offer a scalable, privacy-conscious, and practical solution for the healthcare AI community. Institutions seeking to modernize clinical research and health informatics infrastructure will find this work especially consequential.
Read more in Leveraging large language models for structured information extraction from pathology reports by Jeya Balaji Balasubramanian and other researchers