BELLADONNA

Breast Expert-Led LLM Agent for Data Organization, Normalization & Narrative Aggregation

An autonomous, modular AI agentic pipeline to aggregate and structure the entirety of publicly available breast cancer knowledge — and rigorously benchmark how well today's LLMs can answer breast cancer questions.

Breast Cancer Agentic AI LLM Benchmarking Knowledge Aggregation Clinical Decision Support Explainability

Why BELLADONNA?

Clinicians, patients, and caregivers are increasingly turning to unregulated large language models for guidance — yet the accuracy and safety of these tools for breast cancer remain unknown. BELLADONNA addresses this critical gap.

The challenge

The oncology community faces an information overload: it is increasingly difficult to collect trustable, vetted information about breast cancer. Knowledge is scattered across thousands of clinical guidelines, research papers, trial databases, and conference abstracts.

At the same time, patients and clinicians are using tools like ChatGPT for medical guidance without knowing how reliable these models really are. We fundamentally don't know how accurate LLMs are in the field of breast cancer — creating uncertainty about which uses should be encouraged or discouraged.

BELLADONNA tackles both problems: we use LLMs themselves, in expert-supervised agentic workflows, to parse existing knowledge and subsequently benchmark model performance. The result is an open, transparent infrastructure for AI-expert collaboration in breast cancer knowledge synthesis.

Research Question 1

Can an autonomous, modular AI agentic pipeline reliably aggregate and structure the entirety of publicly available breast cancer knowledge into a database of discrete “knowledge units”?

Research Question 2

Using a benchmark of tiered clinical questions derived from that database, how accurately do current proprietary and open-source LLMs perform — and which question features most strongly predict errors?

Three interconnected workstreams

BELLADONNA spans knowledge aggregation, benchmark creation, and LLM evaluation — together creating a rigorous, end-to-end framework for assessing AI in breast oncology.

📚

Knowledge Aggregation

An agentic AI pipeline ingests clinical guidelines (NCCN, ESMO), PubMed Central articles, and AACR/ASCO abstracts. Open-source LLMs extract factoids that are deduplicated, mapped to biomedical ontologies, and stored as discrete, domain-tagged knowledge units in a relational database.

🧠

Benchmark Creation

From the validated knowledge base, question–answer pairs are generated using Bloom-taxonomy prompts across difficulty tiers, including safety-critical “red-flag” items. An interdisciplinary expert panel refines questions via a Delphi process, with a separate expert-only hold-out set.

🤖

LLM Evaluation

Both proprietary and open-source LLMs are evaluated on accuracy, reproducibility, and model-reported certainty. Statistical analyses link question features to error odds, and explainability visualizations reveal each model's reasoning.

How the agentic pipeline works

A modular pipeline moving from raw public knowledge sources to a validated, structured database of breast cancer facts.

1

Retrieval & Preprocessing

Clinical guidelines, PubMed Central articles, and conference abstracts are ingested, tokenized, and prepared for structured extraction via NER and sentence splitting.

2

Extraction & Deduplication

An ensemble of open-source LLMs proposes factoids from preprocessed content. A fuzzy-logic layer deduplicates and consolidates information from overlapping sources.

3

Schema Mapping

Each fact is aligned to a common biomedical ontology (e.g., OMOP) and stored as a discrete, domain-tagged knowledge unit in a relational database.

4

Expert Validation

Independent experts validate knowledge units via a web interface. Precision and recall are logged per category, driving iterative refinement of the extraction pipeline.

Rigorous LLM evaluation

A tiered benchmark gives an immediate, quantifiable measure of how much we can trust today's LLMs to answer breast cancer questions — and identifies their failure modes.

Benchmark Design

An interdisciplinary panel of oncologists, pathologists, and patient advocates curates tiered question–answer sets via a Delphi consensus process.

  • Bloom-taxonomy difficulty tiers: simple, moderate, hard, red-flag
  • Safety-critical questions targeting high-risk failure modes
  • Separate expert-only hold-out set to prevent circularity

Model Evaluation & Explainability

Proprietary and open-source LLMs are evaluated on accuracy, reproducibility, and certainty, accompanied by explainability visualizations via an interactive web portal.

  • Statistical analyses linking question features to error patterns
  • Word-level feature importance and attention-rollout maps
  • All code, data, and benchmarks released under open-source license

A project by Kather Lab

BELLADONNA is developed at the Kather Lab.

Supporting publications

Key publications from the Kather Lab that form the scientific foundation for BELLADONNA.

  1. [1] Boehm KM, El Nahhas OSM, Marra A, et al. Multimodal histopathologic models stratify hormone receptor-positive early breast cancer. Nat Commun. 2025;16: 2106.
  2. [2] Ferber D, Wiest IC, Wölflein G, et al. GPT-4 for Information Retrieval and Comparison of Medical Oncology Guidelines. NEJM AI. 2024;1: AIcs2300235.
  3. [3] Lee Y, Ferber D, Rood JE, Regev A, Kather JN. How AI agents will change cancer research and oncology. Nat Cancer. 2024;5: 1765–1767.
  4. [4] Ferber D, El Nahhas OSM, Wölflein G, et al. Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. arXiv [cs.AI]. 2024.

BELLADONNA Modules

Interactive tools and resources from the BELLADONNA project. Some modules require authorized access.