About BioCosm

An independent biotech intelligence platform.

What BioCosm Is

The data behind drug development is almost all public and almost unusable. It sits scattered across a dozen databases that each describe the same drug differently and none of which quite agree. BioCosm reconciles them into a single, continuously corrected map of the clinical-stage landscape: every drug program, diagnostic, and pipeline compound, organized by molecular target and clinical phase. It makes the shape of the pipeline visible at a glance, showing where money and science are concentrating, which targets are crowded, and which programs are approaching the readouts that will make or break them.

It is not just a directory. Every program is reconciled into a single object, written up as a cited analysis, and (for pipeline compounds) scored for its probability of FDA approval. The whole dataset then rechecks and repairs itself every night, and the same reconciled data is queryable directly by AI assistants, not just browsable by people.

The database covers 3,700+ drugs, targets, and diagnostics across oncology, immunology, CNS, cardiovascular, rare disease, and metabolic disorders. Each node links to a structured analytical writeup with verified citations and, for pipeline compounds, a probability-of-success estimate anchored to published phase-transition base rates.

BioCosm is an independent project, not affiliated with any pharmaceutical company, investment firm, or academic institution.

For an honest account of exactly what is included, what is only partial, and what is left out on purpose, see the Coverage page.

How It Works

BioCosm pulls from about ten public databases (listed below), each built for its own purpose with its own way of organizing the world. The hard part is not collecting the data, which is all public and free. It is that no two sources agree on what a drug even is: the same compound is a code name in one database, a brand name in another, and a string of unrelated IDs everywhere else.

So before a single dot lands on the map, the system has to recognize that several records across several databases are the same drug, then assemble each one into a single object, linking a drug to its target, its trials, its approval, and its owner. That object exists in none of the sources. This reconciliation, called entity resolution, runs every night across thousands of programs and is the unglamorous core of the project.

Each writeup is grounded in real citations: PubMed PMIDs, ClinicalTrials NCT IDs, FDA application numbers, and SEC filing references. Before a writeup goes live, its citations are checked against the live sources, and ones that fail the check hold the writeup back rather than ship with it. The check is not infallible (sources move, and a reference can still go stale after it passes), so a citation is best treated as a fast path to the primary source, not a guarantee.

Probability-of-success estimates use phase-transition base rates from Wong, Siah & Lo (2019) adjusted by eight program-specific factors. The full methodology is documented openly. See the Methodology page for details.

Keeping the Data Honest

Public biotech data is not just messy, it goes stale. Drugs advance phases, trials fail, approvals land, and a fact that was correct in January is wrong by June. So data quality here is not a one-time scrub. It is a continuous loop that runs every night.

An automated audit re-checks fields across the database against the live source APIs, looking for three kinds of problem: values that are simply wrong (an approved drug mislabeled as Phase 2), values that have gone out of date, and prose that drifts from the structured facts. When it finds something, the correction flows back into the database and re-triggers the writeups and scores that depended on it. Each pass tends to find less than the last, so the dataset converges toward correctness rather than quietly decaying.

Because BioCosm is automated and AI-generated, any individual fact can be incomplete, stale, or outright wrong despite these safeguards. Treat it as a research starting point, not ground truth, and verify anything that matters against primary sources.

Citations are not decoration. PubMed IDs, trial numbers, and filing references are checked against their live sources before a writeup ships, and ones that fail the check hold it back rather than get dressed up as fact. The check is a gate, not an afterthought, though no automated check catches everything. Probability scores are deliberately called estimates, not predictions, and the model is only ever fed facts that were knowable before the outcome, so a number can be checked against reality later instead of flattering itself in hindsight.

Coverage is stated plainly rather than overclaimed. Some products are excluded on purpose (hardware devices, supplements), some areas have thinner pipeline depth than others, and the program count reflects distinct drugs after deduplication, not raw trial rows. The full scope, and its limits, are documented on the Coverage page.

Data Sources

BioCosm reconciles about ten public databases, each organizing reality differently:

  • ClinicalTrials.gov: trial registrations, phase, status, endpoints, and sponsors
  • ChEMBL: compound structures, mechanism of action, and target associations
  • UniProt: the protein-target definitions the entire map is organized around
  • PubChem / UniChem: chemical identity and cross-database compound matching
  • OpenFDA / Drugs@FDA: approval status, labeling, and adverse-event data
  • SEC EDGAR: revenue and pipeline disclosures from 10-K, 10-Q, and 8-K filings
  • OpenTargets: target-to-disease associations
  • Yahoo Finance: public-market valuation
  • RxNorm: normalizing the many names a single drug carries

All source data is publicly available. BioCosm aggregates and analyzes it; it does not reproduce protected content. Data freshness varies, so always verify against primary sources before acting on any data point.

Get in touch

Found an error, want to use BioCosm, or just want to talk? Send a note. Reader corrections genuinely make the data better.

Not financial advice. BioCosm is an intelligence tool that organizes public data. Probability scores reflect structured estimates of clinical trial success, not investment recommendations. See Terms for full disclaimer.