BELLADONNA

Breast Expert-Led LLM Agent for Data Organization, Normalization & Narrative Aggregation

An agentic pipeline that aggregates publicly available breast cancer knowledge from clinical guidelines, PubMed Central literature, trial registries and conference abstracts into a structured factoid database, and benchmarks large language models against expert-curated questions derived from that corpus.

Breast oncology Agentic extraction LLM benchmarking Delphi consensus Explainability

A research project of Kather Lab, TU Dresden, funded by the Breast Cancer Research Foundation.

About the project

Motivation

Clinicians, patients and caregivers increasingly consult general-purpose large language models about breast cancer. The accuracy and safety of these models in this clinical domain remains largely unmeasured.

The problem

Reliable breast cancer knowledge is distributed across thousands of clinical guidelines, peer-reviewed publications, trial registry records and conference abstracts. Keeping a unified picture current by manual curation is not feasible at this scale.

At the same time, patients and clinicians query tools such as ChatGPT without knowing in which areas the answers are reliable, in which they are wrong, and in which they are wrong in a way that is clinically dangerous.

The approach

BELLADONNA applies large language models in expert-supervised agentic workflows: first to parse existing breast cancer knowledge into a structured factoid database, then to derive a tiered question set used to measure how well other models answer that material. All code, data and benchmarks are released under an open-source license.

Research question 1

Can an agentic pipeline reliably aggregate publicly available breast cancer knowledge into discrete, schema-normalised factoids?

Research question 2

How accurately do proprietary and open-source large language models answer breast cancer questions derived from that knowledge base, and which question features predict errors?

Research pillars

Three workstreams

Knowledge aggregation, benchmark creation and model evaluation. Each stage feeds the next; together they yield a reproducible measurement of large language model performance in breast oncology.

Knowledge aggregation

An agentic pipeline ingests NCCN and ESMO guidelines, PubMed Central full-text articles, and AACR / ASCO conference abstracts. Open-source large language models extract candidate factoids, which are then deduplicated, mapped to biomedical ontologies, and stored as discrete domain-tagged records in a relational database.

Benchmark creation

Question and answer pairs are generated from the validated knowledge base using Bloom-taxonomy prompts across four difficulty tiers, including safety-critical “red-flag” items. An interdisciplinary expert panel refines the questions through a Delphi consensus process; a separate expert-only hold-out set prevents circularity.

Model evaluation

Proprietary and open-source large language models are scored on accuracy, reproducibility, and self-reported certainty. Statistical analyses link question features to error odds, and explainability visualisations expose the reasoning the model used to arrive at each answer.

Technical pipeline

How the agentic pipeline works

Four modular stages take raw public sources to a validated, structured database of breast cancer factoids.

Retrieval and preprocessing

Guidelines, PubMed Central articles and conference abstracts are downloaded, tokenised and prepared for structured extraction: named-entity recognition, sentence splitting, and section-aware chunking.

Extraction and deduplication

An ensemble of open-source large language models proposes factoids from the preprocessed content. A fuzzy-match layer deduplicates and consolidates statements made by overlapping sources.

Schema mapping

Each fact is mapped to a common biomedical ontology (for example OMOP) and stored as a discrete domain-tagged knowledge unit in a relational database.

Expert validation

Independent clinical experts validate knowledge units via a web interface. Precision and recall are logged per topic, and these scores drive iterative refinement of the extraction pipeline.

Benchmarking

Evaluating large language models

A tiered benchmark quantifies where current large language models are reliable on breast cancer questions, and where they are not.

Benchmark design

An interdisciplinary panel of medical oncologists, surgeons, radiation oncologists, pathologists and patient advocates curates the question and answer sets through a Delphi consensus process.

Bloom-taxonomy tiers: simple, moderate, hard, red-flag
Safety-critical items targeting high-risk failure modes
Separate expert-only hold-out set to prevent circularity

Evaluation and explainability

Models are scored on accuracy, reproducibility, and self-reported certainty. Results are presented through an interactive web portal.

Statistical analyses linking question features to error patterns
Word-level feature importance and attention-rollout maps
All code, data and benchmarks released under an open-source license

People and funding

Who runs BELLADONNA, and who pays for it

BELLADONNA is developed at the Kather Lab, Else Kröner Fresenius Center for Digital Health, TU Dresden, in collaboration with the University Hospital Carl Gustav Carus Dresden and the National Center for Tumor Diseases Dresden. Clinical collaborators contribute expert validation and the Delphi consensus process.

Principal investigator

Prof. Dr. Jakob Nikolas Kather

Professor of Clinical Artificial Intelligence,
Else Kröner Fresenius Center for Digital Health,
Faculty of Medicine, TU Dresden, Germany

Joint appointments at the University Hospital Carl Gustav Carus Dresden and the National Center for Tumor Diseases Dresden.

kather.ai

Funding

BELLADONNA is funded by the Breast Cancer Research Foundation (BCRF), a non-profit founded in 1993 by Evelyn H. Lauder and based in New York City. BCRF supports investigators directly rather than projects, with grants vetted by its Scientific Advisory Board and renewed on the basis of demonstrated progress.

In the 2025–26 cycle, BCRF awarded approximately USD 74.75 million in grants to more than 260 investigators worldwide. Since its founding the Foundation has raised over one billion dollars for breast cancer research and holds the highest ratings from Charity Navigator, CharityWatch and Candid.

bcrf.org

Selected references

Supporting publications

Key publications from the Kather Lab that form the scientific foundation for BELLADONNA.

Boehm KM, El Nahhas OSM, Marra A, et al. Multimodal histopathologic models stratify hormone receptor-positive early breast cancer. Nat Commun. 2025;16:2106.
Ferber D, Wiest IC, Wölflein G, et al. GPT-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI. 2024;1:AIcs2300235.
Lee Y, Ferber D, Rood JE, Regev A, Kather JN. How AI agents will change cancer research and oncology. Nat Cancer. 2024;5:1765–1767.
Ferber D, El Nahhas OSM, Wölflein G, et al. Autonomous artificial intelligence agents for clinical decision making in oncology. arXiv [cs.AI]. 2024.