MARCUS: Revolutionizing Chemical Literature Curation with AI-Driven Annotation and Recognition

- MARCUS seamlessly integrates text annotation with multi-engine optical chemical structure recognition for automated curation.
- It uses a fine-tuned GPT-4 model and an ensemble of DECIMER, MolNexTR, and MolScribe to optimize chemical entity and structure extraction accuracy.
- The web-based platform supports batch processing and direct submission to the open-access COCONUT natural products database.
- MARCUS significantly reduces manual curation time by automating the conversion of unstructured literature into machine-readable chemical data.
- Open-source code and comprehensive documentation foster community collaboration and future tool enhancements.
- Though automated, expert human validation remains essential for ensuring data quality and accuracy.
- Rajan K., Weißenborn V., Lederer L., Zielesny A., Steinbeck C. MARCUS: Molecular Annotation and Recognition for Curating Unravelled Structures, ChemRxiv, 2025:
https://chemrxiv.org/engage/chemrxiv/article-details/686b86cb1a8f9bdab5017104 - Rajan K., et al., A review of optical chemical structure recognition tools, Google Scholar, 2025:
https://scholar.google.com/citations?user=-1Kqb3IAAAAJ&hl=en - Software Tools and Resources 2025: Honoring Margaret Dayhoff, J. Proteome Res., 24(3), 2025:
https://pubs.acs.org/page/jprobs/vsi/software-tools-2025 - Understanding Marcus Theory in Coordination Chemistry, Number Analytics Blog, 2025:
https://www.numberanalytics.com/blog/ultimate-guide-marcus-theory-coordination-chemistry
In an era marked by the rapid expansion of chemical research publications, the challenge of extracting and curating molecular information from vast, unstructured scientific literature has become increasingly critical. The innovative tool known as MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures) represents a major leap forward in addressing these challenges through the integration of cutting-edge artificial intelligence and automated recognition technologies. Developed as a web-based platform, MARCUS streamlines the workflow from unstructured PDF documents to actionable data entries in open-access chemical databases, specifically targeting natural product chemistry. By employing a fine-tuned GPT-4 model for precise chemical entity extraction along with an ensemble of Optical Chemical Structure Recognition (OCSR) engines—DECIMER, MolNexTR, and MolScribe—MARCUS optimizes accuracy in converting chemical structures into machine-readable formats. This integration marks a significant improvement over existing isolated solutions that often suffer from limited scalability.
Users experience a seamless interface combining document upload, text annotation, structure segmentation, and direct submission to the COCONUT natural products database, significantly accelerating the traditionally manual and error-prone curation process. MARCUS’s architecture, including a user-friendly frontend backed by robust containerized services, supports concurrent users and ensures stable performance, which is key for broader adoption by researchers. The platform not only prioritizes precision in data extraction but also embraces open science ideals through its open-source code, accessible models, and extensive documentation. This openness encourages community-driven development and invites collaboration that could expand MARCUS’s functionality beyond natural products. While expert human oversight remains essential for validation, MARCUS reduces the cognitive load on curators by converting about two-thirds of structure images accurately on first pass—a remarkable achievement considering the complexity of chemical imagery. By bridging the divide between exponentially growing chemical literature and the structured demands of modern AI-driven discovery workflows, MARCUS is swiftly becoming an indispensable asset that promotes FAIR data principles and empowers researchers worldwide to harness literature with unprecedented efficiency, paving the way for accelerated innovation in chemistry.