Unlocking the Promise and Pitfalls of Large Language Models in Biomedical NLP

- State-of-the-art fine-tuned models outperform current LLMs in extraction and classification within BioNLP.
- GPT-4 excels in medical question answering, achieving up to 30% higher accuracy over prior best methods.
- LLMs are prone to inconsistencies, missing information, and hallucinations—especially in zero-shot mode.
- Advanced prompt engineering and few-shot learning measurably reduce errors in LLM outputs.
- Cost and context length limits remain significant barriers to LLM deployment in routine biomedical applications.
- Chen, Q., et al. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations:
https://doi.org/10.1038/s41467-025-56989-2 - Supplementary datasets and code repository:
https://doi.org/10.5281/zenodo.14025500 - BioBERT: Lee, J. et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining:
https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 - PubMedBERT: Gu, Y. et al. (2021). Domain-specific language model pretraining for biomedical natural language processing:
https://dl.acm.org/doi/10.1145/3450439 - PMC LLaMA model resource: Wu, C. et al. (2024). PMC-LLaMA: toward building open-source language models for medicine:
https://jamanetwork.com/journals/jamia/article-abstract/2817226
The explosion of biomedical literature poses daunting challenges for researchers, clinicians, and knowledge managers. With over 36 million articles on PubMed and thousands more released each day, extracting actionable knowledge has become a monumental task. Enter biomedical Natural Language Processing (BioNLP), a field supercharged by the rise of large language models (LLMs) like GPT-3.5, GPT-4, and open-source alternatives such as LLaMA 2. But do these cutting-edge models truly outperform specialized, fine-tuned tools in extracting, summarizing, and interpreting complex bioscience texts?
A comprehensive study by Chen et al. systematically benchmarks leading LLMs against state-of-the-art (SOTA) fine-tuning methods across 12 major biomedical NLP tasks. The results reveal both the immense potential and practical limitations of deploying generalist LLMs in specialized biomedical domains.
The Race: Generalist LLMs vs. Domain-Specific Fine-Tuning
The study pits four prominent LLMs—GPT-3.5, GPT-4, LLaMA 2, and PMC LLaMA—against fine-tuned domain-specific models like BioBERT and BioBART in tasks spanning named entity recognition, relation extraction, multi-label classification, question answering, summarization, and text simplification.
Key findings show that:
Fine-tuned BERT/BART models consistently outperform LLMs in most extractive tasks, such as identifying gene names or chemical-disease relations, with up to 40% higher accuracy in information extraction.
LLMs, particularly GPT-4, shine in reasoning-heavy applications, notably medical question answering. Here, GPT-4 surpasses the SOTA fine-tuned methods by nearly 30% in accuracy—a notable leap for zero- and few-shot learning.
For generative tasks such as summarization and simplification, LLMs deliver competitive accuracy and notably higher readability, though they may lag in completeness compared to fine-tuned baselines.
Hallucinations and Inconsistencies: The Reliability Dilemma
Despite their capabilities, LLMs—especially in zero-shot settings—frequently produce inconsistent, incomplete, or hallucinated outputs. LLaMA 2, for example, exhibited hallucinations in up to 32% of test cases for complex classification tasks. However, introducing even a single carefully selected example (few-shot learning) can dramatically curb these errors.
Manual review remains essential, given that automatic scoring models (such as ROUGE-L for summaries) do not always align with human evaluators, who consistently preferred the readability of GPT-generated summaries.
Practical Considerations: Cost, Context, and Open-Science
Cost is a critical factor in real-world application. GPT-4, while leading in performance for some tasks, can be 60 to 100 times more expensive than its predecessors—raising scalability concerns for large-scale or routine use.
Interestingly, the study found that continual biomedical pretraining (as in PMC LLaMA) offers only marginal gains over standard open-source models after fine-tuning. This highlights a need for more efficient, sustainable development of domain-specific LLMs.
Key Recommendations
Based on extensive benchmarking and error analysis, the authors suggest:
Use fine-tuned BERT/BART models as the go-to for extractive tasks.
Deploy closed-source LLMs, starting with GPT-3.5, for reasoning and summarization, keeping costs in mind.
Apply advanced prompt engineering and manual review to minimize hallucinations and maximize reliability.
Fine-tune open-source LLMs when input lengths and resources allow for task-specific performance boosts.
Revisit evaluation paradigms, as existing benchmarks may not capture the nuanced performance of LLMs, especially in generative tasks.
The Road Ahead
While LLMs have closed part of the gap with traditional fine-tuned models in some BioNLP tasks—especially those demanding reasoning—the field still contends with reliability, reproducibility, and cost challenges. Tailored evaluation methodologies and new benchmarks are needed to fully harness the transformative power of LLMs in biomedicine.