Across disciplines, nearly two-thirds of efforts devoted towards organizational change—such as those premised on novel technologies—fail. In healthcare, incorporation of research to the clinic takes over a quarter-century, and, even following incorporation, interventions received often demonstrate low clinical response rates. Yet, in digital health, greater than 300,000 mobile applications and 340 consumer wearable devices exist—with 200 new mobile applications added daily (as of 2017). Research indicates that between 60-90% of start-ups—the primary source of these applications—fail over time. These patterns bode poorly for the ability of digital health solutions to ever reach the patient bedside.
Furthermore, should they progress to clinical practice, there are reasons to doubt the potential outcomes these technologies may have on patients. A paucity of rigorous regulatory requirements—coupled with acceleration of FDA marketing approvals (under 510(k) “substantial equivalence” pathways)—have raised questions regarding the safety and efficacy of these innovations. In 2011, a warning was issued that a Rheumatology Calculator app may have underestimated risk scores by as much as 50%. In 2012, a recall was issued for an insulin dosing app due to dangerously large inaccuracies. In 2013, profound diagnostic misses were documented in smartphone platforms for melanoma detection. Since that time, a myriad of measurement errors and validation failures have been documented across indications and specialties, leading to speculation that a reproducibility crisis could be engulfing the field. Such a reproducibility crisis would imply poor medical outcomes and, in turn, investment outcomes. In other words, it would imply the imminent bursting of dually scientific and financial bubbles.
It is in this context that we sought to investigate—and distinguish—two key underlying features of digital health technologies that leverage AI/ML: the algorithm, and the training data. In the past, a stark dichotomy has been drawn between the algorithm and the data. Observations made related to racial and socioeconomic bias ingrained in these technologies have placed primary blame on the algorithms for such deficiencies.
However, this overlooks the fact that outputs of algorithms are but a byproduct of the datasets upon which they are trained and tested. We describe in the article a number of potential output flaws (false positives and false negatives) that emanate from deficient datasets (with intrinsic sampling and observation bias). Particularly relating to the potential for these technologies to demonstrate racial disparities, we comment that “[t]he abundance of data cannot presuppose its needed diversity, representative of the populations the algorithms seek to serve.” As we state, “data are not necessarily useful simply because they are voluminous.” To blame the algorithm without attention to the underlying data neglects the critical importance of expanding and troubleshooting datasets—rather than debugging algorithms alone—in improving the reproducibility, utility, and generalizability of these technologies.
Finally, we discuss a potential new paradigm for the use of big data in healthcare: inductive reasoning. Compared with the conventional paradigm which leverages big data for clinical decision support via deductive reasoning (and is susceptible to reductive conclusions based on a limited universe of available training data), inductive reasoning offers a means of “clinical decision questioning” that can be of particular value in an era of personalized healthcare.
We hope that the distinctions and suggestions made in our article can help digital medicine reach its potential to equitably and inclusively improve the lives of patients.