BELLADONNA

Knowledge Atlas …

Browse data ↗ ← Back

Explore

All 181,113 documents in the BELLADONNA corpus, projected into 2-D. Drag to pan, scroll to zoom.

Color by

Sources

Year range

Search title

click or type to load titles

Display

point size

opacity

Controls:
Drag to pan · Mouse wheel to zoom · Hover for title · Click for details

Loading…

How the atlas is built

Each dot is one source document. Positions come from a sentence-transformer embedding, so neighbors mean similar content, not just shared vocabulary.

Corpus. 181,113 documents — EPMC, Elsevier, ClinicalTrials.gov, ASCO/ESMO/AGO abstracts, EMA. 4,653,550 atomic factoids, 1997–2026. The full corpus is rendered in a single WebGL-style pixel blit.
Semantic embeddings. BAAI/bge-small-en-v1.5 — 33M-param sentence transformer, 384-dim output, run via fastembed (ONNX runtime, no PyTorch). Each document's concatenated factoid text (first ~600 chars, truncated to 128 tokens) is encoded, then L2-normalized.
UMAP. UMAP(n_neighbors=15, min_dist=0.15, metric="cosine") — projects the 384-dim embeddings to 2-D while preserving local neighborhoods.
k-means. MiniBatchKMeans(k=24) on the embeddings (not on the 2-D projection). Per-cluster topic labels come from a discriminative-frequency score on the factoid text.
Rendering. Positions packed into a 1.4 MB binary file; attributes into a 1.1 MB file. Points are drawn by writing pixels directly into an ImageData buffer (no per-point canvas calls), which is what lets all 181,113 points paint in a single frame on a laptop GPU.

What this is — and isn't. A semantic atlas: proximity reflects what a small general-purpose sentence transformer thinks two documents are about. It is not a clinical ontology — treat topic labels as regional hints, not strict categories, and don't assume proximity implies causal or therapeutic relationship.