BELLADONNA

Knowledge Atlas

Browse data ↗ ← Back
Loading…

How the atlas is built

Each dot is one source document. Positions come from a sentence-transformer embedding, so neighbors mean similar content, not just shared vocabulary.

  1. Corpus. 181,113 documents — EPMC, Elsevier, ClinicalTrials.gov, ASCO/ESMO/AGO abstracts, EMA. 4,653,550 atomic factoids, 1997–2026. The full corpus is rendered in a single WebGL-style pixel blit.
  2. Semantic embeddings. BAAI/bge-small-en-v1.5 — 33M-param sentence transformer, 384-dim output, run via fastembed (ONNX runtime, no PyTorch). Each document's concatenated factoid text (first ~600 chars, truncated to 128 tokens) is encoded, then L2-normalized.
  3. UMAP. UMAP(n_neighbors=15, min_dist=0.15, metric="cosine") — projects the 384-dim embeddings to 2-D while preserving local neighborhoods.
  4. k-means. MiniBatchKMeans(k=24) on the embeddings (not on the 2-D projection). Per-cluster topic labels come from a discriminative-frequency score on the factoid text.
  5. Rendering. Positions packed into a 1.4 MB binary file; attributes into a 1.1 MB file. Points are drawn by writing pixels directly into an ImageData buffer (no per-point canvas calls), which is what lets all 181,113 points paint in a single frame on a laptop GPU.

What this is — and isn't. A semantic atlas: proximity reflects what a small general-purpose sentence transformer thinks two documents are about. It is not a clinical ontology — treat topic labels as regional hints, not strict categories, and don't assume proximity implies causal or therapeutic relationship.