Investigating physician trust in AI systems: where is the line between effective collaboration and over-reliance?

AI advice systems need to work alongside doctors to be effective on-the-ground. Here, we study if and how doctors trust the advice coming from these systems, and find a significant risk of over-reliance.

Like Comment
Read the paper

AI and ML advice systems hold a great deal of potential for clinical settings, from early detection of breast cancer to reducing racial disparities in pain treatment. But beyond achieving good performance in isolation, they must also be able to operate in complex socio-technical environments and work effectively with human decision-makers.  

Our paper focuses on how expert and non-expert physicians modulate their trust in AI-generated advice in order to walk a tight line: if physicians do not trust AI advice they will not use it, but blind trust could lead to medical errors. To evaluate this dynamic, we presented radiologists and internal/emergency medicine (IM/EM) physicians with a series of chest x-rays and diagnostic advice. We varied two things: whether we presented the advice as coming from an AI system or a fellow radiologist, and whether the advice was correct or incorrect.  

An image with a flowchart of the study procedure.  On the top half, an icon of a doctor leads to an icon of an X-ray with a box titled "Advice".  The Advice box contains a "SOURCE" field (which can be either AI or human) and an "ACCURACY" field (either accurate or inaccurate).  An arrow points to text reading "Quality evaluation" followed by "Diagnosis".  The bottom half contains a specific example of an X-ray and the corresponding advice.

We found that all physicians were susceptible to incorrect advice, regardless of its source. For high-risk settings like diagnostic decision making, such over-reliance on advice can be dangerous. When physicians ask for advice from colleagues, it is often after their initial review of the case, and entails a back-and-forth discussion. Automated systems may be less easily engaged in dialogue, which could prime physicians to search for, and accept, confirmatory information in place of conducting a thorough and critical evaluation.  

We also found that the over-reliance effect was more pronounced for physicians with less task expertise (IM/EM physicians). This suggests that task expertise is another important consideration in the deployment of clinical decision-aids; there is likely not a “one size fits all” approach to development and design, and understanding specific users’ behavior will be an integral step. 

Finally, we found significant variation in decision-making even within physicians of a particular expertise level. While some variation amongst physicians is natural, variations due to differences in physicians’ trust of support systems could be minimized through additional guidelines, regulations, or training.  

AI is an important potential tool in healthcare, and we highlight research directions that are likely to be critical to their effective deployment: for example, designing effective ways to communicate advice uncertainty, or on-boarding tools for physicians to understand a system’s limitations and calibrate their trust accordingly.

Overall, the fact that physicians were not able to effectively filter inaccurate advice raises both concerns and opportunities for AI-based decision-support systems in clinical settings. While we are not able to regulate the advice that physicians might give one another, we can aim to design AI systems and interfaces to enable more optimal collaboration. 

Harini Suresh

PhD Student, MIT

No comments yet.