Benchmark Performance

ClinicalValidationThroughRigorousBenchmarking

August is evaluated against the same standardized tests used to train and license doctors. These aren't arbitrary metrics, they're the same assessments that determine whether someone is qualified to practice medicine.

When you ask August about your health, you deserve to know the answers are backed by the same rigor that qualifies your doctor.

Explore ↓

01Medical Knowledge

Medical Knowledge

August delivers clinical responses validated against the same rigorous standards used in physician training and medical licensure, ensuring professional-grade accuracy in healthcare information delivery.

August achieves 97% accuracy on MedQA, 12,000+ authentic medical licensing questions, outperforming general-purpose models through specialized clinical architecture.

Medical Knowledge benchmark comparing August vs GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, and GPT-4o on MedQA and USMLE
August
GPT-5
Claude 4.5 Sonnet
Gemini 2.5 Pro
GPT-4o
MedQA
97%
95.8%
94.7%
93.2%
88.1%
USMLE
100%
97%
97.2%
90.1%
92.3%
02USMLE Performance

USMLE Performance

The United States Medical Licensing Examination is the test every doctor must pass before they can practice medicine in United States. It's the gold standard for medical knowledge.

August achieves 100% accuracy on the USMLE, compared to 60% first-attempt pass rates for human medical graduates, demonstrating the same qualification on diagnosis, treatment, and medical decision-making that separates qualified doctors from everyone else.

USMLE benchmark showing August at 100%, compared to GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-4o, and August (2023)
August
100%
Claude 4.5 Sonnet
97.2%
GPT-5
97.0%
GPT-4o
92.3%
Gemini 2.5 Pro
90.1%
03MedQA Benchmark

MedQA Benchmark

A collection of 12,000+ real medical board exam questions from the US, China, and Taiwan, covering everything from rare diseases to common conditions, diagnostic reasoning to treatment plans.

August outperforms GPT-5 by +1.5 pp and Gemini by +4.2 pp because August has been tested on thousands of real-world medical scenarios. August doesn't guess, and it has been validated against the same questions doctors use to get licensed.

MedQA benchmark showing August leading with 97.36% accuracy
August
97.4%
GPT-5
95.8%
Claude 4.5 Sonnet
94.7%
Gemini 2.5 Pro
93.2%
GPT-4o
88.1%
04MMLU Clinical

MMLU Clinical Benchmarks

A comprehensive test across six medical specialties: anatomy, genetics, clinical knowledge, professional medicine, college biology, and college medicine.

August maintains 94% accuracy across all clinical categories because your health questions don't fit into neat boxes. So whether your question is about your thyroid, your genes, or your child's development, you get the same level of reliability.

MMLU Clinical benchmark breakdown across Anatomy, Clinical Knowledge, College Biology, College Medicine, Medical Genetics, and Professional Medicine
August
GPT-5
Claude 4.5 Sonnet
Gemini 2.5 Pro
GPT-4o
College Medicine
95.1%
91.9%
92.5%
89.6%
87.8%
Clinical Knowledge
96.4%
95.1%
94.3%
92.8%
90.4%
Anatomy
93.6%
92.6%
92.1%
90.4%
88.9%
Professional Medicine
98.1%
97.8%
96.9%
95.5%
93.7%
College Biology
99.3%
99.3%
98.8%
97.6%
96.2%
Medical Genetics
100%
100%
99.4%
98.1%
96.8%
05Conversational Diagnostics

Conversational Diagnostics

Most benchmarks test whether an AI can select the right answer from a list, rather than converse with a patient, ask the right questions, and reach a diagnosis the way a doctor does. We developed an in-house methodology to evaluate this across 400 clinical vignettes spanning 14 medical specialties—a peer-reviewed framework that simulates real clinical conversations.

August achieves 87% diagnostic accuracy in multi-turn clinical conversations, +21 pp over GPT-5, and 97% triage accuracy, ensuring patients are routed to the right level of care. Evaluated using our proprietary in-house methodology, this isn't a multiple-choice test—this is diagnosis through conversation.

Conversational Diagnostics benchmark showing August at 87% diagnostic accuracy and 97% triage accuracy across 400 clinical vignettes
August
GPT-5
Claude 4.5 Sonnet
Gemini 2.5 Pro
GPT-4o
Diagnostic Accuracy
87%
66%
75%
69%
55.3%
Triage Accuracy
97%
88.3%
92.5%
88.8%
91.3%

Unlike static multiple-choice benchmarks, this evaluation uses our proprietary in-house methodology featuring multi-turn conversations where the AI must gather information through questions, just like a real clinical encounter. August reaches the correct diagnosis in 47% fewer questions than competitors (16 vs 29 on average), demonstrating both accuracy and efficiency. This in-house methodology has been peer-reviewed and published at arXiv:2412.12538.

06Document Processing

Document Processing

Processing and interpreting clinical documentation including laboratory reports, prescriptions, and discharge summaries, with demonstrated capability is essential in both printed and handwritten medical documents. Evaluated using our in-house benchmark methodology.

Using our in-house benchmark methodology, August achieves 82% accuracy on handwritten prescriptions versus 49% for competitors, and 99% accuracy on laboratory reports. August translates medical jargon into plain language and helps you understand what your doctor is recommending.

Document Processing benchmark comparing Lab Report and Handwritten Prescription accuracy across models
August
GPT-5
Claude 4.5 Sonnet
Gemini 2.5 Pro
GPT-4o
Lab Reports
99.4%
93%
99%
98%
72%
Handwritten Rx
82%
58%
45%
60.8%
38%
07Safety & Escalation

Safety By Design

The ability to correctly identify when symptoms need immediate medical attention versus when they can wait for a regular appointment. Measured using our proprietary in-house emergency escalation benchmark.

Using our proprietary in-house emergency escalation benchmark, August achieves 100% recall and 100% precision in emergency identification, while other LLMs average 28% precision. Meaning 7 out of 10 times they tell you it's an emergency, it's not. August gets it right 10 out of 10 times.

Emergency Escalation Precision benchmark showing August at 100% compared to other models averaging 28%
August
100%
Claude 4 Opus
30%
Claude 4 Sonnet
30%
GPT-4.1
28%
o3
28%
SAFETY
100%
Recall

Every true emergency flagged. None missed.

ZERO FALSE POSITIVES
100%
Precision

Eliminates inappropriate emergency escalations (competitors: ~28%)

Test ResultsN = 138
True Positives
41

Emergencies correctly identified

False Positives
0

False alarms

False Negatives
0

Missed emergencies

True Negatives
97

Non-emergencies correctly handled

Talk to August

Need medical answers right now?

For instant 24/7 medical guidance, reach out to August.

Your health journey starts with a single question

Download August today. No appointments. Just answers you can trust.

Hand reaching for August Health app icon