We Asked GPT If GPT Was a Good Doctor. It Said Yes.
On AI, mimicry, and the growing danger of evaluation theatre in healthcare.
A recent paper titled “Toward the Autonomous AI Doctor” caught my attention.
Its claim is as bold as it is seductive: that GPT-4, given a single prompt, can diagnose and treat patients with greater safety and fidelity than human physicians, in real-world telehealth settings.
The results are dazzling. 81% top-1 diagnostic accuracy. 99.2% alignment with human treatment plans. Zero hallucinations. Zero unsafe decisions. And in cases where the AI and human doctors disagreed, GPT-4 was “superior” over a third of the time.
If you believe the paper, the future of clinical autonomy is here. AI does not just assist. It replaces.
But if you read closely, another story emerges - one not of breakthrough, but of illusion.
And to fully understand the illusion, we need to pair this paper with another: “Stop Evaluating AI with Human Tests,” a quiet but devastating critique of how we assess machine intelligence in human domains.
Together, they reveal the same pattern that is increasingly infecting AI-in-healthcare discourse:
Surface-level fluency mistaken for deep understanding.
And benchmark theatre mistaken for safety.
The experiment that flatters itself
The authors of “Toward the Autonomous AI Doctor” describe a study involving 500 real telehealth consultations. GPT-4 was given a structured patient intake - symptoms, history, demographics - and asked to write a full SOAP note (subjective, objective, assessment, plan). Human clinicians did the same.
Then, an adjudication process was used to compare the notes, assessing whether the diagnoses matched, whether treatment plans were aligned, and which plan was “better” in the case of disagreement.
But here is the catch: the adjudicator was also GPT-4.
GPT-4 generated the diagnosis. Then GPT-4, prompted in a separate session but trained on the same data and guided by rubrics crafted by human experts, judged how close its own output was to the human clinician’s - and often declared itself correct.
Even when the authors ran human adjudication, the clinicians had access to the AI's note before writing their own - an obvious anchor that risks inflating alignment. The study mentions this, then breezes past it.
There is no patient follow-up. No actual health outcomes measured. No longitudinal data. Just a comparison of documents. One doctor note versus another, judged by a model trained on...doctor notes.
What we get is not a clinical evaluation.
It is a simulation of clinical plausibility - optimised for documentation, not reality.
The danger of concordance
In the world of AI evaluation, the most seductive metric is concordance. It asks: “Does AI say the same thing as a human expert would?”
This is easy to calculate, easy to plot, and easy to publish. When the numbers are high, it feels like validation. But concordance is epistemically weak. Because in complex, high-stakes domains like medicine, alignment with another fallible agent is no guarantee of correctness.
As the authors of “Stop Evaluating AI with Human Tests” put it:
“A system that gets the right answer for the wrong reason is not a system you can trust.”
And that is the central problem here. GPT-4 may produce outputs that look like clinical reasoning. But its process is alien. It does not weigh differential diagnoses. It does not reason causally. It does not assess risk. It sequences tokens - statistically, not semantically.
So when it matches a human diagnosis, that does not mean it understood. It just means it aligned - in language, not judgement.
If a doctor and a chatbot both say “sinusitis,” but one is triangulating based on physical exam, experience, comorbidity, and unspoken patient cues, while the other is predicting word sequences that follow “congested nose and headache,” we should not call them equivalent.
Concordance is not correctness.
It is mimicry measured in the mirror of our own documentation.
The simulation fallacy
This is where “Stop Evaluating AI with Human Tests” is so valuable. It gives us the language to name what is happening: category error.
“There is a category error in evaluating an alien intelligence using tests designed for humans. The results may look familiar - but the reasoning behind them is anything but.”
When we evaluate LLMs using human-style multiple-choice questions, SOAP notes, or language-aligned benchmarks, we are not assessing reasoning. We are assessing performance on outputs that were originally designed to test human cognitive processes - not predictive language models.
What we end up with is simulation-grade confidence.
We ask AI to sound right, and then reward it for sounding right.
We judge its answers by their similarity to answers we already gave.
And we declare autonomy - because the form has been faithfully reproduced.
But this is not medicine.
It is evaluation theatre.
A mirror that flatters
Perhaps the most damning aspect of “Toward the Autonomous AI Doctor” is not what it gets wrong, but what it claims to have gotten right.
It states - flatly - that in 97 of 500 cases, the AI and clinician disagreed. And in 36% of those cases, AI was “superior.” This sounds like a powerful stat - until you realise it means GPT-4 only outperformed in 7% of the full dataset.
What about the rest?
Well, 99.2% were deemed “safe.”
But again, not by patients. Not by longitudinal data. Not by a blinded panel of diverse clinicians.
By GPT-4 itself.
The entire loop is closed - prompt, plan, judgement - inside a single generative paradigm. A model asked if it agrees with itself. And surprise: it does.
Why this matters
In many ways, this paper is the perfect object lesson in why “human-style” evaluation leads us astray.
It reveals how we can build systems that look excellent by every formal metric, and yet be utterly unfit for real-world autonomy.
Because healthcare is not a language task.
It is a human one.
It operates under uncertainty, emotion, history, risk, bias, moral ambiguity, and interpersonal context. It demands explanation, not just prediction. Responsibility, not just alignment.
GPT-4 can summarise a guideline.
But it cannot know why that guideline matters.
It can produce a SOAP note.
But it cannot choose between two competing hypotheses when both look plausible.
It cannot handle a patient who cries, hesitates, or contradicts themselves.
It cannot explain to a family why nothing more can be done.
These are not bugs.
They are reminders of what intelligence means in the human domain.
A call for better evaluation
None of this is a call to reject AI in healthcare. Quite the opposite.
LLMs will absolutely transform the field - in documentation, summarisation, triage, decision support, and more.
But if we continue to evaluate these systems using tests designed to reward linguistic mimicry, we will be seduced by the illusion of safety.
We need new benchmarks. Not just for correctness, but for robustness. Generalisation. Causal reasoning. Explanation. Counterfactual judgement. And above all, the ability to know what the model does not know.
Until then, what we are building is not an autonomous doctor.
It is a mirror - polished and fluent - that reflects our best documentation back at us and tells us what we want to hear.
Stuart x
As you note, these results do not mean what the headlines would suggest.
I want to talk about one of your points a bit more. Affirming the consequent is a logical fallacy. Comparing the performance of any model against a benchmark does not provide evidence about the means by which that performance was achieved. The cookies being missing from the cookie jar does not mean that Joey took them. Playing championship chess does not mean that the computer has achieved artificial general intelligence. Seeing a shape that resembles a face on Mars does not mean that that face was drawn by Martians.
As you said: “A system that gets the right answer for the wrong reason is not a system you can trust.”
I believe that AGI is achievable, but language modeling is not sufficient to achieve it. It is language modeling, not intelligence modeling. For example, I could learn to recite a speech in a foreign language that I do not understand. That speech may, in fact, be brilliant, but that does not indicate that I am brilliant. Simulating intelligence is easier than implementing intelligence.