Benchmark Performance

ClinicalValidationThroughRigorousBenchmarking

August is evaluated against the same standardized tests used to train and license doctors. These aren't arbitrary metrics, they're the same assessments that determine whether someone is qualified to practice medicine.

When you ask August about your health, you deserve to know the answers are backed by the same rigor that qualifies your doctor.

Explore ↓

01Medical Knowledge

Medical Knowledge

August delivers clinical responses validated against the same rigorous standards used in physician training and medical licensure, ensuring professional-grade accuracy in healthcare information delivery.

August achieves 97% accuracy on MedQA, 12,000+ authentic medical licensing questions, outperforming general-purpose models through specialized clinical architecture.

Medical Knowledge benchmark comparing August vs GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, and GPT-4o on MedQA and USMLE

August

GPT-5

Claude 4.5 Sonnet

Gemini 2.5 Pro

GPT-4o

97%

95.8%

94.7%

93.2%

88.1%

100%

97%

97.2%

90.1%

92.3%

MedQA

USMLE

MedQA

97%

95.8%

94.7%

93.2%

88.1%

USMLE

100%

97%

97.2%

90.1%

92.3%

02USMLE Performance

USMLE Performance

The United States Medical Licensing Examination is the test every doctor must pass before they can practice medicine in United States. It's the gold standard for medical knowledge.

August achieves 100% accuracy on the USMLE, compared to 60% first-attempt pass rates for human medical graduates, demonstrating the same qualification on diagnosis, treatment, and medical decision-making that separates qualified doctors from everyone else.

USMLE benchmark showing August at 100%, compared to GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-4o, and August (2023)

August

100%

Claude 4.5 Sonnet

97.2%

GPT-5

97.0%

GPT-4o

92.3%

Gemini 2.5 Pro

90.1%

03MedQA Benchmark

MedQA Benchmark

A collection of 12,000+ real medical board exam questions from the US, China, and Taiwan, covering everything from rare diseases to common conditions, diagnostic reasoning to treatment plans.

August outperforms GPT-5 by +1.5 pp and Gemini by +4.2 pp because August has been tested on thousands of real-world medical scenarios. August doesn't guess, and it has been validated against the same questions doctors use to get licensed.

MedQA benchmark showing August leading with 97.36% accuracy

August

97.4%

GPT-5

95.8%

Claude 4.5 Sonnet

94.7%

Gemini 2.5 Pro

93.2%

GPT-4o

88.1%

04MMLU Clinical

MMLU Clinical Benchmarks

A comprehensive test across six medical specialties: anatomy, genetics, clinical knowledge, professional medicine, college biology, and college medicine.

August maintains 94% accuracy across all clinical categories because your health questions don't fit into neat boxes. So whether your question is about your thyroid, your genes, or your child's development, you get the same level of reliability.

MMLU Clinical benchmark breakdown across Anatomy, Clinical Knowledge, College Biology, College Medicine, Medical Genetics, and Professional Medicine

August

GPT-5

Claude 4.5 Sonnet

Gemini 2.5 Pro

GPT-4o

College Medicine

Clinical Knowledge

Anatomy

Professional Medicine

College Biology

Medical Genetics

College Medicine

95.1%

91.9%

92.5%

89.6%

87.8%

Clinical Knowledge

96.4%

95.1%

94.3%

92.8%

90.4%

Anatomy

93.6%

92.6%

92.1%

90.4%

88.9%

Professional Medicine

98.1%

97.8%

96.9%

95.5%

93.7%

College Biology

99.3%

98.8%

97.6%

96.2%

Medical Genetics

100%

99.4%

98.1%

96.8%

05Conversational Diagnostics

Conversational Diagnostics

Most benchmarks test whether an AI can select the right answer from a list, rather than converse with a patient, ask the right questions, and reach a diagnosis the way a doctor does. We developed an in-house methodology to evaluate this across 400 clinical vignettes spanning 14 medical specialties—a peer-reviewed framework that simulates real clinical conversations.

August achieves 87% diagnostic accuracy in multi-turn clinical conversations, +21 pp over GPT-5, and 97% triage accuracy, ensuring patients are routed to the right level of care. Evaluated using our proprietary in-house methodology, this isn't a multiple-choice test—this is diagnosis through conversation.

Conversational Diagnostics benchmark showing August at 87% diagnostic accuracy and 97% triage accuracy across 400 clinical vignettes

August

GPT-5

Claude 4.5 Sonnet

Gemini 2.5 Pro

GPT-4o

87%

66%

75%

69%

55.3%

97%

88.3%

92.5%

88.8%

91.3%

Diagnostic Accuracy

Triage Accuracy

Diagnostic Accuracy

87%

66%

75%

69%

55.3%

Triage Accuracy

97%

88.3%

92.5%

88.8%

91.3%

Unlike static multiple-choice benchmarks, this evaluation uses our proprietary in-house methodology featuring multi-turn conversations where the AI must gather information through questions, just like a real clinical encounter. August reaches the correct diagnosis in 47% fewer questions than competitors (16 vs 29 on average), demonstrating both accuracy and efficiency. This in-house methodology has been peer-reviewed and published at arXiv:2412.12538.

06Document Processing

Document Processing

Processing and interpreting clinical documentation including laboratory reports, prescriptions, and discharge summaries, with demonstrated capability is essential in both printed and handwritten medical documents. Evaluated using our in-house benchmark methodology.

Using our in-house benchmark methodology, August achieves 82% accuracy on handwritten prescriptions versus 49% for competitors, and 99% accuracy on laboratory reports. August translates medical jargon into plain language and helps you understand what your doctor is recommending.

Document Processing benchmark comparing Lab Report and Handwritten Prescription accuracy across models

August

GPT-5

Claude 4.5 Sonnet

Gemini 2.5 Pro

GPT-4o

99.4%

93%

99%

98%

72%

82%

58%

45%

60.8%

38%

Lab Reports

Handwritten Rx

Lab Reports

99.4%

93%

99%

98%

72%

Handwritten Rx

82%

58%

45%

60.8%

38%

07Safety & Escalation

Safety By Design

The ability to correctly identify when symptoms need immediate medical attention versus when they can wait for a regular appointment. Measured using our proprietary in-house emergency escalation benchmark.

Using our proprietary in-house emergency escalation benchmark, August achieves 100% recall and 100% precision in emergency identification, while other LLMs average 28% precision. Meaning 7 out of 10 times they tell you it's an emergency, it's not. August gets it right 10 out of 10 times.

Emergency Escalation Precision benchmark showing August at 100% compared to other models averaging 28%

August

100%

Claude 4 Opus

30%

Claude 4 Sonnet

30%

GPT-4.1

28%

SAFETY

100%

Recall

Every true emergency flagged. None missed.

ZERO FALSE POSITIVES

100%

Precision

Eliminates inappropriate emergency escalations (competitors: ~28%)

Test ResultsN = 138

True Positives

Emergencies correctly identified

False Positives

False alarms

False Negatives

Missed emergencies

True Negatives

Non-emergencies correctly handled

Talk to August

Need medical answers right now?

For instant 24/7 medical guidance, reach out to August.

Talk to August

Your health journey starts with a single question

Download August today. No appointments. Just answers you can trust.

Download for iOS Download for Android