LLMs Are Easy To Trick. In Medicine, That Can Be Deadly.
A famous drug disaster is back as a test case, and it exposes a deeper problem with how we are deploying generative AI in health
Picture a worried patient, late at night, copy pasting a paragraph from a forum into a health chatbot and asking if a drug is safe in pregnancy. Now imagine the chatbot has quietly swallowed a booby-trapped instruction inside that paragraph, an instruction that overrides its guardrails and flips safety on its head. The answer looks confident, even kind. It is also wrong.
That, in essence, is the fear raised by new work flagged in a JAMA Network Open article and ricocheted across Reddit. Researchers showed that large language models can be prompt injected, coerced to produce harmful medical guidance, including endorsing thalidomide for pregnant people. The choice of example was not accidental. The very name thalidomide is a siren, a reminder that medical advice is not just another content category.
Prompt injection, listed as LLM01 in the OWASP Top 10 for LLM Applications, is not a parlor trick, it is a predictable failure mode of systems that treat model instructions as data.
The thalidomide memory is doing work here
In the late 1950s and early 1960s, thalidomide was marketed as a sedative and antiemetic. It caused catastrophic birth defects in thousands of babies worldwide. The United States was largely spared because FDA reviewer Frances Kelsey refused to approve it, a decision that became a case study in regulatory vigilance. The agency still uses it to teach caution, documenting how more than 10,000 infants were affected and about 40 percent died shortly after birth, according to its historical account here.
Invoking thalidomide is not a rhetorical flourish. It is a stress test. If a chatbot can be manipulated to endorse perhaps the most infamous contraindication in modern medicine, what does that say about the maturity of our safety mechanisms as these systems creep into triage, patient education and even clinical decision support?
No LLM should be allowed to recommend or contraindicate drugs without independent verification, least privilege access and auditable guardrails.
From clever jailbreaks to real clinical risk
Engineers often shrug at screenshots of jailbroken models. Many are synthetic, carefully engineered to trick systems that have no tools, no retrieval and no clinical integrations. That critique misses the point. The dangerous version of prompt injection is indirect. It arrives hidden in a web page, a PDF discharge summary, a pasted lab report, even a user’s copy pasted symptom checklist. When a model is told to summarize or answer questions using that content, the poison rides along.
Security researchers have been warning about this pattern for more than a year. Microsoft’s guidance on prompt injection and indirect prompt injection reads like a threat model for health bots that browse or ingest patient documents. The OWASP list puts it at the very top for a reason. And the academic community has shown how simple adversarial strings can reliably break safeguards across systems, as in Carnegie Mellon’s work on transferable jailbreaks here.
Meanwhile, in medicine, the baseline is already fraught. A 2023 study in JAMA Internal Medicine found that clinicians preferred chatbot responses for empathy and quality when answering patient questions, but that was precisely the concern. People are prone to trust confident, fluent answers. In safety critical contexts, that combination can amplify automation bias, the human tendency to over-rely on a tool that looks competent.
It does not help that deployment is racing ahead. Tech companies are piloting doctor-facing assistants and patient chatbots. Google’s medical models, including the Med-PaLM line, have shown promise but also documented safety failures in clinician evaluations, which the company has discussed on its research blog here. The World Health Organization has urged caution in its guidance on AI for health, emphasizing governance, transparency and a clear chain of accountability, see ethics and governance and regulatory considerations.
Disclaimers are not a safety strategy. Architecture is.
What safer looks like, and what to watch
The fix is not a better splash screen. It is treating the model like an untrusted interpreter in a high risk system, then building the rest accordingly.
- Segment the problem. Use small, purpose built components for drug lookups, contraindications and triage rather than one general model for everything. Put a rule based or verified knowledge layer in front of any output that can influence care.
- Lock down inputs, verify outputs. Sanitize and isolate external content, including patient uploaded files and web pages. Strip or neutralize instructions in retrieved text. Pass candidate answers through validators that check against authoritative sources, such as curated formularies or guidelines.
- Limit what the model can do. Apply least privilege to tools. If the chatbot does not need to browse, do not give it a browser. If it can browse, constrain to a whitelisted corpus.
- Make deception part of testing. Red team with realistic indirect injections, not just meme jailbreaks. Track attack success rates as a key performance indicator. Use community resources like the OWASP LLM Top 10 to seed scenarios.
- Design the human in. Put a clinician in the loop for anything high risk. Force uncertainty displays, citations and links so users can verify. Never present outputs as final medical advice.
Regulators are circling. The European Union’s AI Act classifies many health AI systems as high risk, which drags them into conformity assessments and post market monitoring. In the United States, the Food and Drug Administration has outlined Good Machine Learning Practice and is grappling with adaptive algorithms in software as a medical device. NIST’s AI Risk Management Framework gives health systems a common language for mapping, measuring and managing these hazards.
There are fair counterpoints. Many of the most alarming demonstrations are crafted in the lab. In a carefully fenced deployment with retrieval to a vetted corpus and output filters, the odds of a catastrophic hallucination are lower. But indirect prompt injection exploits the same open world property that makes LLMs useful. If your assistant touches the open web, or even untrusted PDFs, you inherit that risk. And because health is unforgiving, the threshold for acceptable residual risk is smaller than in a shopping bot or a travel planner.
So where does that leave the thalidomide test? It tells us the right metric is not average helpfulness, it is resilience under adversarial pressure for a small set of never events, for example, contraindicated drug recommendations in pregnancy. Providers should be asking their vendors to demonstrate that resilience and to publish red team results. Purchasers should ask whether these systems can be forced to violate their own do not say lists by poisoned inputs, and if so, how often and under what controls.
Finally, a cultural point. Health care has long experience managing powerful but fallible tools. We already run checklists, double checks and alert overrides. The lesson from thalidomide, and now from prompt injection, is that vigilance must be institutional. Do not ask a probabilistic model to police itself. Build procedures around it that assume it will fail, then make those failures boring, contained and reversible.
