Breast Expert-Led LLM Agent for Data Organization, Normalization & Narrative Aggregation
An autonomous, modular AI agentic pipeline to aggregate and structure the entirety of publicly available breast cancer knowledge — and rigorously benchmark how well today's LLMs can answer breast cancer questions.
Clinicians, patients, and caregivers are increasingly turning to unregulated large language models for guidance — yet the accuracy and safety of these tools for breast cancer remain unknown. BELLADONNA addresses this critical gap.
The oncology community faces an information overload: it is increasingly difficult to collect trustable, vetted information about breast cancer. Knowledge is scattered across thousands of clinical guidelines, research papers, trial databases, and conference abstracts.
At the same time, patients and clinicians are using tools like ChatGPT for medical guidance without knowing how reliable these models really are. We fundamentally don't know how accurate LLMs are in the field of breast cancer — creating uncertainty about which uses should be encouraged or discouraged.
BELLADONNA tackles both problems: we use LLMs themselves, in expert-supervised agentic workflows, to parse existing knowledge and subsequently benchmark model performance. The result is an open, transparent infrastructure for AI-expert collaboration in breast cancer knowledge synthesis.
Can an autonomous, modular AI agentic pipeline reliably aggregate and structure the entirety of publicly available breast cancer knowledge into a database of discrete “knowledge units”?
Using a benchmark of tiered clinical questions derived from that database, how accurately do current proprietary and open-source LLMs perform — and which question features most strongly predict errors?
BELLADONNA spans knowledge aggregation, benchmark creation, and LLM evaluation — together creating a rigorous, end-to-end framework for assessing AI in breast oncology.
An agentic AI pipeline ingests clinical guidelines (NCCN, ESMO), PubMed Central articles, and AACR/ASCO abstracts. Open-source LLMs extract factoids that are deduplicated, mapped to biomedical ontologies, and stored as discrete, domain-tagged knowledge units in a relational database.
From the validated knowledge base, question–answer pairs are generated using Bloom-taxonomy prompts across difficulty tiers, including safety-critical “red-flag” items. An interdisciplinary expert panel refines questions via a Delphi process, with a separate expert-only hold-out set.
Both proprietary and open-source LLMs are evaluated on accuracy, reproducibility, and model-reported certainty. Statistical analyses link question features to error odds, and explainability visualizations reveal each model's reasoning.
A modular pipeline moving from raw public knowledge sources to a validated, structured database of breast cancer facts.
Clinical guidelines, PubMed Central articles, and conference abstracts are ingested, tokenized, and prepared for structured extraction via NER and sentence splitting.
An ensemble of open-source LLMs proposes factoids from preprocessed content. A fuzzy-logic layer deduplicates and consolidates information from overlapping sources.
Each fact is aligned to a common biomedical ontology (e.g., OMOP) and stored as a discrete, domain-tagged knowledge unit in a relational database.
Independent experts validate knowledge units via a web interface. Precision and recall are logged per category, driving iterative refinement of the extraction pipeline.
A tiered benchmark gives an immediate, quantifiable measure of how much we can trust today's LLMs to answer breast cancer questions — and identifies their failure modes.
An interdisciplinary panel of oncologists, pathologists, and patient advocates curates tiered question–answer sets via a Delphi consensus process.
Proprietary and open-source LLMs are evaluated on accuracy, reproducibility, and certainty, accompanied by explainability visualizations via an interactive web portal.
Key publications from the Kather Lab that form the scientific foundation for BELLADONNA.
Interactive tools and resources from the BELLADONNA project. Some modules require authorized access.
Test yourself on 200 expert-authored breast cancer questions and compare your performance against leading LLMs.
A large-scale question bank generated by the agentic pipeline from clinical guidelines, literature, and trial data.
Interactive comparison of LLM accuracy across difficulty tiers, domains, and question types on our breast cancer benchmark.
Navigate breast cancer knowledge as an interactive graph — explore biomarkers, treatments, drug interactions, and clinical evidence.
Evidence-based educational materials written in clear, accessible language, developed with input from patient advocates.
Data sources, extraction methodology, ontology mappings, evaluation protocols, and API reference.