Evaluating risk when using AI in healthcare – the TRUST framework

ChatGPT has created a surge in interest in the medical community – certainly if you measure by the number of publications about it. There are many potential use cases for these impressive types of Large Language Model (LLM) AI — drafting messages to patients, giving general advice, providing medical education, and more. But for every success like passing the medical licensing exam or providing better responses to patient messages than physicians do, there are failures, like flunking the gastroenterology exam and completely making up medical citations to justify its answers.

So, when AI like ChatGPT be safe to use in medicine? A more precise way to ask that question is when will the risks of using a particular AI model in a particular scenario be acceptable to your particular organization, given the potential benefits? Here we talk about the risk side of that equation – what’s a framework for evaluating the risks to see if they’re acceptable? We call it the TRUST framework: Transparent, Reviewable, Understandable, Secure, Testable. It applies broadly to using AI in medicine, not just to GPTs.

Let’s start with four certainties:

  • AI is most accurate when it has lots of examples. It can be downright inaccurate if it has few (class imbalance / long tail data distribution).
  • If the instances being asked about don’t fit well within the AI’s training, the results can go wrong (domain shift).
  • Real people are often curating inputs or results. Their biases get incorporated into the model (cognitive bias).
  • Bad training data means bad results — AI is no exception to the garbage-in-garbage-out rule.

AI use in medicine must be weighed against these certainties. For example, there are far too many examples of AI bias to choose from concerning race, gender, age, and socioeconomic status among other factors. A model for detecting skin cancer was thought to be highly accurate, but later found to be less than half as accurate for people of color because it was trained on datasets of predominantly fair-skinned patients. In cardiology, coronary heart disease (CHD) is overwhelmingly misdiagnosed in women, yet prediction models are trained on predominantly male datasets. A sleep scoring model seemed to work well, but failed miserably decrypting sleep disorders in older patients, because there weren’t enough of them in the training dataset. A model for asthma management in children was found to be far less accurate for those of lower socio-economic status, largely due to incomplete EHR source data.

Another key risk area for AI in medicine is context. Consider the challenges of using AI trained on a specific large corpus of medical knowledge: clinical guidelines. The question for a healthcare provider is whether the model matches your specific context:

  • Is the AI model trained on the same guidelines you use: from your country, from the appropriate specialty society or source, targeted at your mix of skills and equipment, incorporating your most recent advances, and for your patient population? If not, the model may not be the right one.
  • How recent are the guidelines, and have they changed since the original model training? If they’re out of date, the model will be out of date as well.
  • Was the model trained using EHR data as a proxy for the guidelines? EHRs are notoriously full of errors and only ~ 3050% of physicians use the most recent clinical guidelines. Thus, the model may contain errors.

In each of these cases, the model may not prioritize the right information, display the right data, or suggest the right course of action for your context — all leading to avoidable errors. It’s emblematic of the broader risks created by AI.

Now let’s look at how the TRUST framework can help to understand AI risks.

Transparent

Transparency is about seeing inside the black box of AI. Visibility into the training data, methods, and curators enables grasp of potential bias, patient population mismatch, likelihood of misdiagnosis, reliance on outdated information, the list goes on (and on). Without transparency, AI risk is higher — and certainly more difficult to assess.

From a transparency perspective, insight into the exact composition of the training corpus is essential to evaluate risk. Yet, LLMs and training on large bodies of publications are particularly inscrutable in this regard. Even if you’re provided full training data “transparency” by an AI vendor, it may not be practical to dig into the training datasets themselves. It may be more practical to require transparency into the characteristics of the training data (e.g., specific population), the curators (e.g., demographics), and the methodologies (if any) used to assess and mitigate bias and other errors. There are tools designed specifically to address transparent reporting (like TRIPOD), assess the risk of bias (like PROBAST) and others to mitigate ML bias.

Recommendation: Require transparency into training data, methods, and curators. Assess if they match your full context and assume higher risk if they don’t. Require vendors to follow established methodologies to mitigate errors and bias.

Reviewable

While transparency helps to mitigate individual model risk, reviewability helps to mitigate more systemic risk. It’s most applicable to composite AI, which uses more than one model in sequence or collectively to reach a result. It also applicable to composite AI’s simpler brethren — AI that gathers data from an HER — because both share the same underlying problem. Can you review interim results to ensure that errors are intercepted and corrected so they don’t propagate and cause larger problems? In the complex world of healthcare, “No AI is an island.” The level of reviewability substantially affects the risk of using AI.

Here’s a personal anecdote to illustrate. My EHR documentation says I have heart tumors. Yes, heart tumors. Fortunately, I don’t. After a routine CT, an ML model mis-transcribed “unremarkable heart chambers” into “unremarkable heart tumors.” As I learned the hard way, the word “unremarkable” really doesn’t fit with “heart tumors,” so a subsequent AI model is quite likely to ignore the qualifier. Nor is my problem (records error, not heart tumors) uncommon. Up to half of health records may contain an error, 16% of which may be serious. EHRs are full of errors, omissions, and conflicting data, among other problems. Any AI used to help determine my medications, treatments, risk evaluation, insurance, and so on would all be detrimentally impacted without the ability to intercept the error in-line and correct it for downstream use.

Recommendation: Ensure there’s reviewability at each step in the data pipeline. Ensure that there’s a mechanism for updates, corrections, and consults to supplant erroneous data or results for “downstream” use.

Understandable

Using an AI model entails a lot more risk when you’re unable to understand how it reached a result. Understandability in AI is generally drawn from interpretability and explainability. Interpretability refers to models that are transparent in terms of how outcomes are generated — their “internal chain of thought” or reasoning, if you will. By contrast, explainability refers to creating a second model to explain the initial ML-based system results because the core ML-based system isn’t necessarily transparent (say, because it has millions of parameters). Having an interpretable model means that clinicians can understand and review how an outcome was generated. Consequently, interpretable models are better able to engender trust and less prone to error propagation. By contrast, explainable ML is often unreliable, can be misleading, and may fail to deliver clarity about function and objectives. Think of it as being rewarded for persuasiveness, not accuracy — obviously problematic in most of healthcare.

Recommendation: Focus upon using inherently interpretable AI models. Ensure that in a complex reasoning chain, all the interim results are visible, that the rationale in individual steps be traced back to reputable sources, and that the full context is evident.

Secure

Healthcare has unique security and privacy concerns that impact AI risk. While HIPAA and traditional cybersecurity measures represent ground floor elements in secure AI, medical AI risk also derives from the complex interaction between training data and the training algorithm. If vulnerabilities are found, you can’t simply “patch and continue.” The model itself may need retraining.

AI extends privacy and security threat vectors into new realms. In some AI, even if the original data is deleted, there are model inversion attacks that can reconstruct original training data. ChatGPT is known to memorize training data that should be protected. If a healthcare AI model trained on patient information is “out in the wild,” private patient data can be exposed with the correct attack — a gross violation of privacy. Patient context provided to a LLM in an ongoing conversation is also an issue. All the requisite patient information consumed to answer questions is transferred and stored for at least the duration of that conversation. Where and how it’s stored is an obvious attack vector.

Recommendation: Ensure that patient data and context, as well as the AI model itself, remain “inside your four walls.” Ensure that patient data is neither used in training, nor in model improvement, unless appropriately licensed and secured.

Testable

How can you verify trustworthiness and assess risk except through actual testing? The narrower the use case, the more readily testable the model is — and the more likely it is the developer will provide testing. Of course, given the problems of domain shift, any AI model needs to be tested in situ before being put into practice, which impacts ROI. Methodologies for validating models and quality criteria in AI are coming , but it’s unclear how (if ever) they’ll apply to LLMs. LLM plug-ins may be able to address the issue in specific problem segments, but only if there’s a way to validate that they’re being called appropriately.

Recommendation: Adopt a framework for testing and validation. Prioritize understandable, reviewable AI, where individual steps can be separately and collectively verified via testing.

Conclusion

Popular AI technologies like ChatGPT are impressive, but they are also risky for clinical use because they’re not trustworthy (and it’s possible they may never be). That said, they offer the immediate promise of providing substantial productivity improvements in situations where the stakes are lower — for example, generating draft communications for patients like email messages or discharge notes, answering general medical questions quickly, and the like.

Risk must be weighed against reward with AI healthcare settings. The narrower the AI training, use case, and context, the more straightforward it is to assess and mitigate risk. Conversely, the broader the training and more systemic the AI’s use, the more challenging risk assessment becomes — and the more methodological mitigation must be.

Well-founded fears of poorly understood “black box” conclusions or AI-triggered errors harming patient health rate among several current barriers to AI adoption in healthcare.

For medical use, AI that that you can TRUST — that is fully transparent, reviewable, understandable, secure, and testable — has a much lower risk profile, better addresses those barriers, and will be more easily adopted within health organizations.