Breast Expert-Led LLM Agent for Data Organization, Normalization & Narrative Aggregation
An agentic pipeline that aggregates publicly available breast cancer knowledge from clinical guidelines, PubMed Central literature, trial registries and conference abstracts into a structured factoid database, and benchmarks large language models against expert-curated questions derived from that corpus.
A research project of Kather Lab, TU Dresden, funded by the Breast Cancer Research Foundation.
Clinicians, patients and caregivers increasingly consult general-purpose large language models about breast cancer. The accuracy and safety of these models in this clinical domain remains largely unmeasured.
Reliable breast cancer knowledge is distributed across thousands of clinical guidelines, peer-reviewed publications, trial registry records and conference abstracts. Keeping a unified picture current by manual curation is not feasible at this scale.
At the same time, patients and clinicians query tools such as ChatGPT without knowing in which areas the answers are reliable, in which they are wrong, and in which they are wrong in a way that is clinically dangerous.
BELLADONNA applies large language models in expert-supervised agentic workflows: first to parse existing breast cancer knowledge into a structured factoid database, then to derive a tiered question set used to measure how well other models answer that material. All code, data and benchmarks are released under an open-source license.
Can an agentic pipeline reliably aggregate publicly available breast cancer knowledge into discrete, schema-normalised factoids?
How accurately do proprietary and open-source large language models answer breast cancer questions derived from that knowledge base, and which question features predict errors?
Knowledge aggregation, benchmark creation and model evaluation. Each stage feeds the next; together they yield a reproducible measurement of large language model performance in breast oncology.
An agentic pipeline ingests NCCN and ESMO guidelines, PubMed Central full-text articles, and AACR / ASCO conference abstracts. Open-source large language models extract candidate factoids, which are then deduplicated, mapped to biomedical ontologies, and stored as discrete domain-tagged records in a relational database.
Question and answer pairs are generated from the validated knowledge base using Bloom-taxonomy prompts across four difficulty tiers, including safety-critical “red-flag” items. An interdisciplinary expert panel refines the questions through a Delphi consensus process; a separate expert-only hold-out set prevents circularity.
Proprietary and open-source large language models are scored on accuracy, reproducibility, and self-reported certainty. Statistical analyses link question features to error odds, and explainability visualisations expose the reasoning the model used to arrive at each answer.
Four modular stages take raw public sources to a validated, structured database of breast cancer factoids.
Guidelines, PubMed Central articles and conference abstracts are downloaded, tokenised and prepared for structured extraction: named-entity recognition, sentence splitting, and section-aware chunking.
An ensemble of open-source large language models proposes factoids from the preprocessed content. A fuzzy-match layer deduplicates and consolidates statements made by overlapping sources.
Each fact is mapped to a common biomedical ontology (for example OMOP) and stored as a discrete domain-tagged knowledge unit in a relational database.
Independent clinical experts validate knowledge units via a web interface. Precision and recall are logged per topic, and these scores drive iterative refinement of the extraction pipeline.
A tiered benchmark quantifies where current large language models are reliable on breast cancer questions, and where they are not.
An interdisciplinary panel of medical oncologists, surgeons, radiation oncologists, pathologists and patient advocates curates the question and answer sets through a Delphi consensus process.
Models are scored on accuracy, reproducibility, and self-reported certainty. Results are presented through an interactive web portal.
BELLADONNA is developed at the Kather Lab, Else Kröner Fresenius Center for Digital Health, TU Dresden, in collaboration with the University Hospital Carl Gustav Carus Dresden and the National Center for Tumor Diseases Dresden. Clinical collaborators contribute expert validation and the Delphi consensus process.
Prof. Dr. Jakob Nikolas Kather
Professor of Clinical Artificial Intelligence,
Else Kröner Fresenius Center for Digital Health,
Faculty of Medicine, TU Dresden, Germany
Joint appointments at the University Hospital Carl Gustav Carus Dresden and the National Center for Tumor Diseases Dresden.
BELLADONNA is funded by the Breast Cancer Research Foundation (BCRF), a non-profit founded in 1993 by Evelyn H. Lauder and based in New York City. BCRF supports investigators directly rather than projects, with grants vetted by its Scientific Advisory Board and renewed on the basis of demonstrated progress.
In the 2025–26 cycle, BCRF awarded approximately USD 74.75 million in grants to more than 260 investigators worldwide. Since its founding the Foundation has raised over one billion dollars for breast cancer research and holds the highest ratings from Charity Navigator, CharityWatch and Candid.
Key publications from the Kather Lab that form the scientific foundation for BELLADONNA.
Interactive tools and resources from the project. Some modules are in development; the Knowledge Atlas and the Corpus Explorer are live.
200 expert-authored breast cancer questions. Take the quiz yourself and compare your performance against leading LLMs.
Large-scale question bank generated by the agentic pipeline from clinical guidelines, literature, and trial data.
Side-by-side comparison of LLM accuracy across difficulty tiers, domains, and question types on the breast cancer benchmark.
UMAP projection of all 174,379 source documents, with positions computed from a sentence-transformer embedding. Colour by source, topic cluster, or year; filter, search by title, and click any point for details.
Per-source statistics, full-text document browser, factoid sampler, source-to-topic-to-entity Sankey, and a hierarchical tree view of the 4.56 million extracted factoids.
Evidence-based educational materials in plain language, co-developed with patient advocates.
Data sources, extraction methodology, ontology mappings, evaluation protocols, and API reference.