August AI | Benchmark Performance

Medical Knowledge

August delivers clinical responses validated against the same rigorous standards used in physician training and medical licensure, ensuring professional-grade accuracy in healthcare information delivery.

97%

Physician-level diagnostic accuracy with immediate availability

August achieves 97% accuracy on MedQA, 12,000+ authentic medical licensing questions, outperforming general-purpose models through specialized clinical architecture.

MOST ACCURATE

MedQA Accuracy

97%

+1.5 pp ahead of GPT-5

PERFECT

USMLE Score

100%

Only AI to achieve this

Clinical Significance

These assessments evaluate diagnostic reasoning capabilities essential for accurate symptom analysis and clinical decision support

USMLE Performance

The United States Medical Licensing Examination is the test every doctor must pass before they can practice medicine in United States. It's the gold standard for medical knowledge.

USMLE benchmark showing August at 100%, compared to GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-4o, and August (2023)

100%

Perfect performance on physician qualification assessment

August achieves 100% accuracy on the USMLE, compared to 60% first-attempt pass rates for human medical graduates, demonstrating the same qualification on diagnosis, treatment, and medical decision-making that separates qualified doctors from everyone else.

System	USMLE Score	August Advantage
August	100%
Claude 4.5 Sonnet	97.2%	+2.8 pp
GPT-5	97.0%	+3.0 pp
GPT-4o	92.3%	+7.7 pp
Gemini 2.5 Pro	90.1%	+9.9 pp

The USMLE evaluates clinical decision-making capabilities by assessing diagnostic reasoning based on patient presentations and medical history. August's perfect score demonstrates consistent accuracy in the clinical reasoning scenarios users present daily.

MedQA Benchmark

A collection of 12,000+ real medical board exam questions from the US, China, and Taiwan, covering everything from rare diseases to common conditions, diagnostic reasoning to treatment plans.

97.4%

Validated on authentic board examination questions

August outperforms GPT-5 by +1.5 pp and Gemini by +4.2 pp because August has been tested on thousands of real-world medical scenarios. August doesn't guess, and it has been validated against the same questions doctors use to get licensed.

TOP PERFORMER

August Score

97.4%

+1.5 pp vs GPT-5 (95.8%)

Questions Tested

12k+

Consistent accuracy across all questions

Clinical Relevance

MedQA assessments require multi-step clinical reasoning, integrating symptom analysis, differential diagnosis, and treatment planning, reflecting the complexity of actual clinical inquiries.

MMLU Clinical Benchmarks

A comprehensive test across six medical specialties: anatomy, genetics, clinical knowledge, professional medicine, college biology, and college medicine.

MMLU Clinical benchmark breakdown across Anatomy, Clinical Knowledge, College Biology, College Medicine, Medical Genetics, and Professional Medicine

94–100%

Consistent accuracy across medical specialties

August maintains 94% accuracy across all clinical categories because your health questions don't fit into neat boxes. So whether your question is about your thyroid, your genes, or your child's development, you get the same level of reliability.

Category	August	GPT-5	Result	Why It Matters
College Medicine	95.1%	91.9%	+3.2 pp	Treatment decisions
Clinical Knowledge	96.4%	95.1%	+1.4 pp	Diagnostic reasoning
Anatomy	93.6%	92.6%	+1.0 pp	Understanding symptoms
Professional Medicine	98.1%	97.8%	+0.3 pp	Clinical practice
College Biology	99.3%	99.3%	TIE	Foundational knowledge
Medical Genetics	100%	100%	PERFECT	Hereditary risk

Conversational Diagnostics

Most benchmarks test whether an AI can select the right answer from a list, rather than converse with a patient, ask the right questions, and reach a diagnosis the way a doctor does. We developed an in-house methodology to evaluate this across 400 clinical vignettes spanning 14 medical specialties—a peer-reviewed framework that simulates real clinical conversations.

87%

Highest conversational diagnostic accuracy of any AI system

August achieves 87% diagnostic accuracy in multi-turn clinical conversations, +21 pp over GPT-5, and 97% triage accuracy, ensuring patients are routed to the right level of care. Evaluated using our proprietary in-house methodology, this isn't a multiple-choice test—this is diagnosis through conversation.

System	Diagnostic Accuracy	August Advantage
August	87%
Claude 4.5 Sonnet	75%	+12 pp
Gemini 2.5 Pro	69%	+18 pp
GPT-5	66%	+21 pp
GPT-4o	55.3%	+31.7 pp

System	Triage Accuracy	August Advantage
August	97%
Claude 4.5 Sonnet	92.5%	+4.5 pp
GPT-4o	91.3%	+5.7 pp
Gemini 2.5 Pro	88.8%	+8.2 pp
GPT-5	88.3%	+8.7 pp

Unlike static multiple-choice benchmarks, this evaluation uses our proprietary in-house methodology featuring multi-turn conversations where the AI must gather information through questions, just like a real clinical encounter. August reaches the correct diagnosis in 47% fewer questions than competitors (16 vs 29 on average), demonstrating both accuracy and efficiency. This in-house methodology has been peer-reviewed and published at arXiv:2412.12538

Document Processing

Processing and interpreting clinical documentation including laboratory reports, prescriptions, and discharge summaries, with demonstrated capability is essential in both printed and handwritten medical documents. Evaluated using our in-house benchmark methodology.

82%

Advanced handwritten medical document interpretation

Using our in-house benchmark methodology, August achieves 82% accuracy on handwritten prescriptions versus 49% for competitors, and 99% accuracy on laboratory reports. August translates medical jargon into plain language and helps you understand what your doctor is recommending.

KEY DIFFERENTIATOR

Handwritten Rx

82%

+21.2 pp vs next best (Gemini: 60.8%)

Lab Reports

99.4%

Provides accessible interpretation of laboratory results for patient comprehension

Clinical Significance

Accurate interpretation of medical documentation reduces medication errors and improves patient understanding of clinical information.

Safety By Design

The ability to correctly identify when symptoms need immediate medical attention versus when they can wait for a regular appointment. Measured using our proprietary in-house emergency escalation benchmark.

Emergency Escalation Precision benchmark showing August at 100% compared to other models averaging 28%

100%

Perfect sensitivity and specificity in emergency identification

Using our proprietary in-house emergency escalation benchmark, August achieves 100% recall and 100% precision in emergency identification, while other LLMs average 28% precision. Meaning 7 out of 10 times they tell you it's an emergency, it's not. August gets it right 10 out of 10 times.

SAFETY

Recall

100%

Every true emergency flagged. None missed.

ZERO FALSE POSITIVES

Precision

100%

Eliminates inappropriate emergency escalations (competitors: ~28%)

Test Results (N=138)

TP: 41 emergencies correctly identified
TN: 97 non-emergencies correctly handled
FP: 0 false alarms
FN: 0 missed emergencies

Talk to August

Need medical answers right now?

For instant 24/7 medical guidance, reach out to August.

Talk to August

Clinical Validation Through Rigorous Benchmarking

Medical Knowledge

USMLE Performance

MedQA Benchmark

MMLU Clinical Benchmarks

Conversational Diagnostics

Document Processing

Safety By Design