About BioCosm

An independent biotech intelligence platform.

What BioCosm Is

The data behind drug and diagnostic development is almost all public and almost unusable. It sits scattered across a dozen databases that each describe the same drug differently and none of which quite agree. BioCosm reconciles them into a single, continuously corrected map of the clinical-stage landscape: a curated, deduplicated set of 3,700+ drug programs, diagnostics, and pipeline compounds, organized by molecular target and clinical phase. It is a deliberately bounded set, not a census of everything in development, and within that scope it makes the shape of the pipeline visible at a glance: where money and science are concentrating, which targets are crowded, and which programs are approaching the readouts that will make or break them.

It is not just a directory. Each program in it is reconciled into a single object, written up as a cited analysis, and (for pipeline compounds) scored for its probability of FDA approval. The whole dataset then rechecks and repairs itself every night, and the same reconciled data is queryable directly by AI assistants through a Model Context Protocol (MCP) server, not just browsable by people.

Coverage spans oncology, immunology, CNS, cardiovascular, rare disease, and metabolic disorders, though depth varies by area. Each node links to a structured analytical writeup with sourced citations and, for pipeline compounds, a probability-of-success estimate anchored to published phase-transition base rates.

It is free, and it is independent. A reconciled, analyzed map of the clinical pipeline like this is normally a paid product that runs into thousands of dollars a year per seat; BioCosm is open to anyone, no login and no paywall, and is not affiliated with any pharmaceutical company, investment firm, or academic institution. For a candid account of what is included, what is partial, and what is left out on purpose, see the Coverage page; to see the model's live calls, the Predictions page.

How It Works

BioCosm pulls from nine public databases (listed below), each built for its own purpose and each holding a piece of the picture the others lack: one knows a drug's chemistry, another its trials, another its FDA status, another the company financials behind it. No single source has the whole story, so the complete object only exists once they are joined. Collecting the data is the easy part; it is all public and free. The hard part is that no two sources agree on what a drug even is: the same compound is a code name in one database, a brand name in another, and unrelated IDs everywhere else.

So before a single dot lands on the map, the system has to recognize that several records across several databases are the same drug, then assemble each one into a single object, linking a drug to its target, its trials, its approval, and its owner. That object exists in none of the sources. This reconciliation, called entity resolution, is a genuinely hard data-engineering problem, the sort that companies build entire teams and commercial platforms around. Here it runs automatically every night across thousands of programs, and it is the unglamorous core of the project.

The map is the skeleton; the writeups are the substance. Every node gets one, and each is a genuine analysis rather than a data dump: what the drug, diagnostic, or company is, how it works, what its trials actually show, and where the real risks sit, written in plain language for an intelligent reader. The voice is a sharp colleague who has read both the papers and the filings and tells you what matters and why. For pipeline drugs the writeup also shows the math behind the probability-of-success estimate, factor by factor.

Each writeup is grounded in real citations: PubMed PMIDs, ClinicalTrials NCT IDs, FDA application numbers, and SEC filing references. Every reference is checked against its live source before the writeup publishes, and any that fail hold it back. Sources can still move or go stale, but that is exactly the point of a citation: the primary source is one click away.

Probability-of-success estimates use phase-transition base rates from Wong, Siah & Lo (2019) adjusted by eight program-specific factors. The full methodology is documented openly. See the Methodology page for details.

Keeping the Data Honest

Public biotech data is not just messy, it goes stale. Drugs advance phases, trials fail, approvals land, and a fact that was correct in January is wrong by June. So data quality here is not a one-time scrub. It is a continuous loop that runs every night.

An automated audit re-checks fields across the database against the live source APIs, looking for three kinds of problem: values that are simply wrong (an approved drug mislabeled as Phase 2), values that have gone out of date, and prose that drifts from the structured facts. When it finds something, the correction flows back into the database and re-triggers the writeups and scores that depended on it. Each pass tends to find less than the last, so the dataset converges toward correctness rather than quietly decaying.

No automated system is perfect, and a fact can still slip through wrong or stale. That is exactly why everything is sourced: BioCosm is built to get you oriented fast and to show its work, so when a decision actually rides on a number, the primary document is already one click away. Use it the way you would a sharp analyst's brief, and confirm the load-bearing facts at the source.

Probability scores are deliberately called estimates, not predictions, and the model is only ever fed facts that were knowable before the outcome, so a number can be checked against reality later instead of flattering itself in hindsight. Coverage is stated plainly too: some products are excluded on purpose (hardware devices, supplements), some areas have thinner pipeline depth than others, and the program count reflects distinct drugs after deduplication, not raw trial rows.

Data Sources

Each of these sources organizes reality differently:

ClinicalTrials.gov: trial registrations, phase, status, endpoints, and sponsors
ChEMBL: compound structures, mechanism of action, and target associations
UniProt: the protein-target definitions the entire map is organized around
PubChem / UniChem: chemical identity and cross-database compound matching
OpenFDA / Drugs@FDA: approval status, labeling, and adverse-event data
SEC EDGAR: revenue and pipeline disclosures from 10-K, 10-Q, and 8-K filings
OpenTargets: target-to-disease associations
Yahoo Finance: public-market valuation
RxNorm: normalizing the many names a single drug carries

All source data is publicly available; BioCosm aggregates and analyzes it, it does not reproduce protected content. Freshness varies by source.

Get in touch

Found an error, want to use BioCosm, or just want to talk? Send a note. Reader corrections genuinely make the data better.

Not financial advice. BioCosm is an intelligence tool that organizes public data. Probability scores reflect structured estimates of clinical trial success, not investment recommendations. See Terms for full disclaimer.