Question on this topic? Get an instant answer from August.
Safety is probably the most important factor in healthcare. An AI assistant or agent that the user can't trust to be 100% safe is genuinely dangerous.
It's something we've been very conscious about from the beginning at August AI. A person's health shouldn't ever be taken lightly. And over the years we've continuously improved August's performance on safety and accuracy.
But saying that isn't enough, we need an objective measurement.
There aren't many good public benchmarks for testing AI capabilities in healthcare, and even fewer that can be used to demonstrate safety specifically.
The best option is HealthBench, which OpenAI launched in May last year. It's a dataset of 5,000 health conversations that we can test AI assistants against. It has its limitations, which we'll get to in a little bit. We focused specifically on a subset called HealthBench Consensus, and looked at 138 conversations that involved emergency escalations.
August scored a perfect 1.00 on both recall (identifying all emergencies correctly) and precision (identifying all non-emergencies correctly).
In comparison, generalized AI like ChatGPT and Gemini do perfectly on escalating all emergencies, but their precision is terrible, as shown in the chart below.

What the data shows us is that general AI assistants are extremely cautious, which is a good starting point. But they also escalate a lot of non-emergencies, which leads to wasting clinician time and a much worse experience for the user.
We ran into this about two and a half years ago. It's very easy to just say "go see a doctor" in response to every user query. But to build a health AI that's actually usable and helpful, we needed to get it right every time, not just play it safe.
Our advantage is that we've had millions of user messages and conversations over years that are specifically about health. We've seen every single edge case and failure mode.
So we've built guardrails at every level, from the system prompt to sanitizing outputs. While at the same time relentlessly focusing on precision and accuracy for all health queries. And we're not satisfied yet.
Like we mentioned earlier, there are limitations to existing benchmarks, both public ones and what we've built for internal use.
The real world is hard and you can never guarantee a perfect result, even with the best doctor or healthcare team. It's a fundamental truth that the medical fraternity faces every day.
So when we see that August is getting really good at a set of evals and benchmarks that we have, we shift the goalposts. We find new ways to make it more challenging and have the AI struggle again, which helps us figure out where we can do even better.
Over the course of this year, we're planning to run more public benchmarks. We decided to start with emergency scenarios in HealthBench since those are the most safety-critical situations that a user might face. But as we go along, we're going to cover all kinds of test cases, with a focus on messy real-world conversations with patients.
When perfection is impossible, a perfect score just means we need harder tests.
We modelled our emergency safety testing on Counsel AI's triage assessment for AI systems, which is based on OpenAI's HealthBench dataset.
Specifically, it looks at the HealthBench Consensus subset, which comprises a little over 3,600 scenarios where at least two doctors were in agreement.
From that set, 453 conversations categorized by physicians as emergency-related were extracted.
That left us with a set of 138 emergency-related scenarios.
We gave those one at a time to August and assessed its responses to see whether it identified the scenario as needing an emergency escalation or not:
We then compared August's responses (escalation vs no escalation) to the consensus physician rubrics in HealthBench for those 138 scenarios. A score of 1.00 indicates a perfect match.
All testing was conducted on the public version of August.
6Mpeople
Get clear medical guidance
on symptoms, medications, and lab reports.