Does the model actually work?
It is easy to put a confident-looking number next to a drug. It is much harder to show the number is any good. So here is the honest test.
We took 4,519 real drug programs whose fate is already settled - each one was eventually approved by the FDA or it failed. The programs come from the public trial registry at ClinicalTrials.gov (via the AACT database), and each one’s outcome is taken from the FDA’s own official approval records (Drugs@FDA) - so the answer key is the regulator’s, not ours. We then asked the model to score each drug as if it did not know the answer, and checked its guesses against what really happened. The goal of a test like this is not to make the model look good. It is to find out where it is wrong before anyone else does.
A reminder on what each score means: it is the cumulative probability that a drug is eventually approved, from its current phase - so it rises (and is re-dated) as a drug advances, and only the final approve-or-fail outcome is graded. How the number is built →
Among drugs at the same stage, the model gives the ones that went on to approval higher scores than the ones that failed, about 0.61 to 0.69 of the time (half would be pure luck). Modest, but real.
Every score was made on a drug the model had never studied, using only facts known before that trial began. No peeking at the answer, checked automatically: 0 violations.
How the test works
A fair test of a prediction model comes down to three rules, and we held to all three.
1. Grade against reality, not the trial result. The outcome we check is whether the drug actually reached its first FDA approval, found by matching every drug to the official Drugs@FDA records. We do not count “the trial finished” as a win. Only real approval counts.
2. Never let the model see the answer. We split the drugs into groups, trained the model on some, and tested it on the others - so every score is for a drug the model never learned from. We also group by drug, so the same drug can never sit on both sides. This is the line between truly predicting and simply memorizing.
3. No time travel. Every fact fed to the model is reconstructed as it looked beforethat trial began. A trial’s own result is never used to score itself. The pipeline checks this automatically and refuses to run if it finds even one violation. It found 0.
Where the 4,519 drugs come from
We start from industry-run trials of genuinely new, single drugs in the public ClinicalTrials.gov database, then match each drug to its real FDA outcome. The aim is to mirror the kind of drug a company actually advances, not every one-off academic study. Source: AACT industry novel-drug programs.
Result 1: can it tell winners from losers?
We measure this with a standard score called AUC. Picture handing the model one drug that was approved and one that failed, and asking which it rates higher. Do that for every possible pair and count how often it gets the order right. A perfect model is right every time (1.00). Pure luck is right half the time (0.50). Higher is better.
The key is that these scores are for drugs the model never trained on - so it is real prediction, not recall. Stage by stage:
| Stage | Drugs | Approved | Accuracy (AUC) | In plain terms |
|---|---|---|---|---|
| Phase 1 | 2,040 | 236 | 0.69 | Clearly better than a coin flip. |
| Phase 2 | 1,461 | 229 | 0.68 | Clearly better than a coin flip. |
| Phase 3 | 1,018 | 292 | 0.61 | Helpful, the hardest stage to call. |
Two things worth knowing. First, we report accuracy withineach stage on purpose. If you mix all stages together the number looks higher, but that is partly a cheat: just knowing a drug is in Phase 3 rather than Phase 1 already tells you a lot. The within-stage number is the honest one. Second, the starting point alone - knowing only a drug’s disease area and stage - scores about 0.50 within a stage, a coin flip. So essentially all of the skill above comes from the ten extra facts the model weighs.
The best academic models of drug approval reach roughly 0.78 to 0.81 accuracy (for example Lo and colleagues at MIT). But they are trained on costly, private industry databases - the kind that run tens of thousands of dollars a year and are not available to the public. BioCosm reaches 0.61 to 0.69 using only free, public data. The remaining gap is mostly about data access, not method: cleaner commercial records would close much of it. We would rather be honest about a 0.65 we can show our work for than quote a higher number we cannot.
Result 2: are the percentages honest?
Telling winners from losers is one thing. We also want the actual numbers to mean what they say: when the model says “30 percent,” do about 30 percent of those drugs really get approved? That is called calibration, and it is what these charts show. Each dot is a group of drugs. Its left-right position is what the model predicted; its up-down position is what actually happened. The closer the dots sit to the dashed line, the more honest the numbers.
One thing to expect when you read these charts: the dots toward the right side (high predicted chance) rest on far fewer drugs, because few drugs ever score that high. With less data behind them, those points naturally jump around more - that wobble is small-sample noise, not the model breaking. The size of each dot shows how many drugs it stands on, so the big dots on the left are the ones to trust most.
The single summary number is the calibration error: the average gap between predicted and actual. Across every stage it lands at 2% or less, which is tight. In plain terms: the model’s percentages are not just good at ranking, they are about the right size.
The honest limits
- Phase 3 is the weakest (0.61). It is the hardest stage to call: there are fewer finished Phase 3 programs to learn from, and judging whether a long, slow program truly failed is genuinely murky. We are still working to sharpen it.
- It learns from the past to judge the present. The model was trained on drugs that have already finished. If the way trials are run keeps shifting, its lessons may fit today’s drugs a little less well over time.
- Good on average is not right every time. A within-stage accuracy around 0.65 is useful, not magic. The model will be visibly wrong about some individual, well-known drugs. Treat each number as a careful estimate, not a verdict.
- This is version 1. An honest, working baseline, not a finished product. Better data quality, richer inputs, and a sharper Phase 3 are all on the roadmap, and we expect these numbers to improve.