Behind the paper: Machine Learning for Patient Risk Stratification: Standing on, or looking over, the shoulders of clinicians?

Our recent publication, Machine Learning for Patient Risk Stratification: Standing on, or looking over, the shoulders of clinicians?, in npj Digital Medicine examines the question of whether clinical machine learning models truly extend beyond what clinicians already suspect.

Like Comment
Read the paper

While diagnosis and prognosis are important areas where algorithms can provide value, they must be designed in a manner that acknowledges the role that physician behavior has on patient physiology. Data accessible through the electronic medical record (EMR) are increasingly available and are consequently a popular target for training machine learning algorithms. However, the specific data elements in the EMR, such as diagnoses, notes, and prescriptions represent specific expressions of expertise by the physician who generated them. In contrast, data modalities such as imaging or telemetry are direct representations of patient physiology. We describe these as "clinician initiated data" and "non-clinician imitated data," respectively. We hypothesized that this difference is critical in the development and interpretation of any model intended to guide clinical practice. While models constructed from physiological data may offer genuine insights to physicians, models constructed on behavioral data from physicians may provide predictions learned from "looking over the shoulder" of the physician-user. As an example, an algorithm may learn that troponin tests represent strong predictive signal towards future diagnoses of myocardial infarction (MI). Because these tests are likely ordered by physicians who already suspect MI, models that utilize this signal are highly accurate in predicting patient outcomes without being useful to physicians. The tendency for models to reinterpret existing physician suspicions may 

To understand the role that clinician initiated data can play in a predictive model, we examined a popular set of hospital outcome prediction tasks: in-hospital mortality, readmission, and extended length of stay. For this study, we used a dataset of charge details representing nearly 43 million hospitalizations across 973 hospitals nationwide. These charge details represent a unique dataset consisting only of the actions undertaken and resources utilized by a physician during patient care- a dataset consisting only of clinician-initiated data. After the inclusion of demographic and provider details, this data represented only an average of 120 features per patient encounter. We compared the performance of models trained over this dataset to full-EMR literature benchmarks utilizing an average of 217,000 features per patient. Despite orders of magnitude fewer features and computational complexity, our models achieved performance close to the benchmarks: in-hospital mortality (0.89 AUC), prolonged length of stay (0.82 AUC) and 30-day readmission rate (0.71 AUC). These results suggest that clinician-initiated data, manifesting through physician behaviors, are an extremely potent source of signal for models when available.

If models derive significant signal from physician behavior, model  performance is likely to be heavily influenced by whether the physician population has reached a diagnosis. Post-diagnostic behaviors are likely to be motivated by the diagnosis itself, providing an additional mechanism for physician suspicion to leak into training data. To examine this, we further compared the performance of models trained over hospital-wide populations to models trained over patients that received a diagnosis of MI during their admission. We observed a significant decline in performance when the hospital-wide model was applied to the MI cohort, suggesting an inability for the model to "guess" physician suspicion without a prior diagnosis.

Our results have two important implications. First, they validate the information content of lower resolution, administrative datasets for observational, rather than predictive purposes. While prediction is a popular topic, there are many other areas where large scale analyses can provide value. Analyses aimed at improving logistics, cohort selection, or guideline creation are all made easier using clinician- initiated data. Second, for applications that focus on providing individualized patient outcome predictions, our results suggest that models that utilize clinician-initiated data should utilize physician behavior as a baseline of performance.

By understanding which situations allow models to extrapolate from physician behaviors, researchers can better target domains where algorithmic guidance is most likely to provide genuine, novel guidance. Clinician-initiated data has a valuable role in benchmarking existing predictive algorithms and providing a basis for wider observational study.

Brett Beaulieu-Jones

Instructor of Biomedical Informatics, Harvard Medical School