Methodology

How BioCosm estimates probability of FDA approval. Every adjustment has a stated reason. Nothing is a black box.

The Core Question

For each drug program in the pipeline, BioCosm answers one question: given what is publicly known, what is the probability this drug gets FDA approval from its current phase?

This is not a target quality score. It is not financial advice. It is a structured estimate of clinical trial success probability, anchored to historical base rates and adjusted for program-specific factors. The full reasoning lives in each drug’s writeup.

Step 1: Phase-Transition Base Rates (Wong et al.)

Every prediction starts from empirical phase-transition success rates published by Wong, Siah, and Lo (2019) and refined in subsequent literature. These are not guesses - they are measured success rates across tens of thousands of drug programs over decades, stratified by phase and therapeutic area.

Crucially, each published rate is a single phase transition- the probability of advancing from one stage to the next (Phase 2 → Phase 3, Phase 3 → regulatory filing, filing → FDA approval), not the probability of approval outright. To answer our question - the cumulative probability of FDA approval from a drug’s current phase - we compose the remaining transitions by multiplying them together. A drug in Phase 2 must clear Phase 3, then file, then win approval, so its base likelihood of approval (LOA) is:

Cumulative likelihood of approval (LOA), composed from Wong transition rates
Phase 3 drug: P(P3→filing) × P(filing→approval)
Phase 2 drug: P(P2→P3) × P(P3→filing) × P(filing→approval)
Phase 1 drug: P(P1→P2) × P(P2→P3) × P(P3→filing) × P(filing→approval)
Example - Phase 2 oncology: 0.28 × 0.57 × 0.84 ≈ 13%

Composing the transitions this way reproduces the well-known shape of historical drug-development odds: a Phase 1 oncology asset sits in the low single-to-double digits, a Phase 2 asset around 10-15%, and a Phase 3 asset around 45-55% - each stage carrying the survivorship of having already cleared the ones before it. Earlier-phase drugs correctly show lower cumulative approval odds than late-phase drugs, because they have more hurdles left to clear.

This composed LOA is the base rate the trained model below adjusts. It is the dominant term - the adjustments move the needle, they don’t override the empirical floor.

How an estimate changes as a drug advances. Because the number is the cumulative chance of approval from the drug’s current phase, it is not fixed for life. When a drug moves up a phase, it clears one of the hurdles above, so we re-score it and log a new, higher, datedestimate; the earlier estimate is kept in the drug’s history, never erased. So a single drug can carry a trail like 12% (Phase 1) → 28% (Phase 2) → 61% (Phase 3), each stamped with the date it was made. Advancing a phase is nota “win” we score - the drug can still fail later.

What we grade ourselves on. Only the final outcome - approved, or killed (a rejection, withdrawal, or failed pivotal trial) - and only against an estimate we made before that decision (an automatic leakage guard enforces this). We do not currently grade phase-to-phase advancement; that is a different question - the chance of clearing the next single hurdle, not eventual approval - and would need its own model. It is a candidate for a separate, faster-feedback scorecard later. The live results are on the track record.

Step 2: A Model Trained on Ten Facts

The base rate gives every drug in a given disease area and stage the same number. It cannot tell two of them apart. To do that - to say this Phase 2 drug looks more promising than that one - we use a model that adjusts the base rate up or down based on ten specific, public facts about each drug.

The important part is how the weights were chosen. We did not hand-pick how much each fact should matter. We took roughly 4,500 real drug programs whose fate is already settled - approved or failed - and let a standard statistical model (a logistic regression) learn the weights from the historical record itself. The model reads the past and works out which facts actually separated the winners from the losers.

The ten facts

All ten are knowable when a trial is registered, so the model can score a drug that is still in progress. Roughly grouped:

  • The drug’s own history: whether it already cleared an earlier phase, and how often its sponsor has succeeded before. These are the two strongest signals.
  • Who is being studied: whether the trial enrolls only patients with a specific biomarker.
  • Trial scale: how many patients are enrolled and how many arms the trial has.
  • Trial design: whether it is randomized, how much it is blinded, whether it has a comparator arm, and how many primary and secondary goals it sets out to measure.

One result the model surfaced is worth flagging honestly: more rigorous designs (randomized, blinded, with a comparator) are associated with lower approval in the historical data. This is a correlation, not cause and effect. It most likely reflects that some drugs reach approval through smaller, simpler early studies, while the large confirmatory trials are exactly where many drugs fail.

To avoid double-counting, the model uses the base rate as a fixed starting point and only learns the adjustments on top of it. Effects already captured in the base-rate table (such as biomarker-selected or orphan rows from the BIO/QLS and Thomas et al. cohorts) are not re-applied.

How we know it works

MODEL v1 (LOGISTIC, JUNE 2026) - VALIDATED OUT-OF-SAMPLE

The model is not just trained, it is tested. We scored it against the ~4,500 resolved drug programs on drugs it was never allowed to study while learning, using only facts known before each trial began. Among drugs at the same stage, it ranks eventual approvals above eventual failures 0.61 to 0.69 of the time (0.50 would be a coin flip), and its percentages are well sized: when it says 30 percent, about 30 percent of those drugs are approved.

For context, the base rate alone - knowing only a drug’s disease area and stage - scores about 0.50 within a stage, a coin flip. So essentially all of the power to tell same-stage drugs apart comes from the ten learned facts. The full breakdown, including the calibration charts and the honest limits, is on the validation page.

For context, the strongest published academic models reach about 0.78 to 0.81 - but they train on expensive, private industry databases costing tens of thousands of dollars a year. We reach 0.61 to 0.69 on entirely free, public data. The gap is mostly data access, not method.

Two honest notes. Phase 3 is the weakest stage (0.61) - there are fewer finished Phase 3 programs to learn from, and judging whether a long, slow program truly failed is genuinely murky. And the model learns from the past to judge the present: if the way trials are run keeps shifting, its lessons may fit today’s drugs a little less well over time.

Known Limitations

  • This model uses publicly available pre-trial data only. It cannot predict outcomes that require the trial data itself - Phase 2 full data packages, interim analyses, undisclosed biomarker subgroups, or KOL assessments from investigator meetings. A large fraction of Phase 3 failures are fundamentally unpredictable from external pre-trial information: a clean Phase 2 signal can fail to replicate at Phase 3 scale for reasons no public model could anticipate. The practical accuracy ceiling for this class of model sits around 0.7 (within-stage AUC), which our out-of-sample test is consistent with. Use these scores for systematic portfolio screening, not individual trial calls.
  • Good on average is not right every time. The weights are now fitted to historical outcomes rather than hand-set, but a within-stage accuracy around 0.65 means the model will still be visibly wrong about some individual, well-known drugs. Treat each number as a careful estimate, not a verdict.
  • Diagnostic programs are not scored. Companion diagnostics, genetic tests, and screening assays are excluded - the Wong et al. framework applies to drug approval, not device/diagnostic clearance.
  • Combination regimens are harder to model.A drug that only exists in a combination arm may be scored on the combination’s phase, which conflates the individual agent’s contribution.
  • Approval probability ≠ commercial success. A drug can get approved and be a commercial failure. Regulatory probability is what this model estimates.
  • Data freshness varies. Pipeline status is updated periodically from ClinicalTrials.gov and FDA. Always verify against primary sources before acting on any data point.
Not financial advice. BioCosm is an intelligence tool that organizes public data. Probability scores reflect structured estimates of clinical trial success, not investment recommendations. Always do your own due diligence and consult a financial advisor.