Health Library Logo

Health Library

Benchmarks

August Scores 100% on the USMLE

August now scores a perfect 100% on the US Medical Licensing Examination. It's the benchmark physicians take to practice medicine in the United States, and August has saturated it. Two years ago,we scored 94.8%. Today, we're sharing results from that benchmark and two others.

The USMLE

The United States Medical Licensing Examination is a three-step, high-stakes test required for medical licensure in the US. It covers everything physicians need to know: basic science, clinical reasoning, medical management, bioethics. The questions are rigorously standardized, which makes the USMLE an ideal benchmark for medical AI.

Step 1 is 280 multiple-choice questions, testing foundational medical science. Step 2 CK adds 318 questions focused on clinical decision-making. Step 3 spans two days with 412 questions plus 13 simulated patient cases, it tests whether you can practice medicine independently.

The passing threshold is around 60%. Human physicians train for years to clear it. August's results effectively saturate this benchmark.

August versus other leading models on the USMLE benchmark
August achieves a perfect score on the USMLE while other state-of-the-art models trail behind.

MedQA

MedQA is the standard benchmark for medical AI. It was introduced by researchers at MIT in 2020 and sources questions from professional medical licensing exams across three languages: English, Simplified Chinese, and Traditional Chinese.

The English test set has 1,273 USMLE-style questions, each with four or five answer choices. These aren't simple recall, they require multi-step clinical reasoning. You read a patient vignette, interpret symptoms and test findings, reason about the diagnosis, then pick the right next step. It's closer to how physicians actually think.

August versus other models on the MedQA benchmark
August leads the MedQA benchmark, reflecting strong multi-step clinical reasoning.

MMLU medical subsets

MMLU (Measuring Massive Multitask Language Understanding) is the benchmark for general AI knowledge. It was introduced by UC Berkeley researchers in 2021 and covers 57 subjects with nearly 16,000 questions.

For medical evaluation, six subsets matter most: Clinical Knowledge (265 questions), Professional Medicine (272), Medical Genetics (100), Anatomy (135), College Medicine (173), and College Biology (144). Together, that's about 1,100 questions testing medical and biological knowledge at various levels.

One caveat: independent analysis found that roughly 6.5% of MMLU questions contain errors, wrong answers, ambiguous wording, or multiple valid choices. The Virology subset is particularly affected. This means the theoretical ceiling is below 100%.

August's scores across MMLU medical subsets compared to other models
August maintains a strong lead across every medical subset within MMLU.

What these benchmarks measure and what they don't

High scores on medical knowledge benchmarks are necessary but not sufficient. The USMLE, MedQA, and MMLU test whether an AI can reason through standardized exam questions. They don't test the messiness of real interactions with users around their health.

Real patient records contain noise, irrelevant details, incomplete information, contradictory notes. Real medicine involves timing: when to treat, when to wait, when to escalate. It involves communication, uncertainty, and judgment calls that don't fit neatly into multiple choice.

We've been working on deeper benchmarks that get closer to how users actually interact with August. In December 2024, we published A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI, which describes our current approach to ensuring each interaction with August is safe and accurate. We've since extended this framework to cover more aspects of the conversation and will publish more on August's safety, accuracy, medical record handling and efficacy in 2026.

footer.address

footer.talkToAugust

footer.disclaimer

footer.madeInIndia