The below is a rough draft of a research project that was unable to be completed in the allotted time for the differential but which I wrote an introduction for and was able to gather information for with 5 cases.
John Flavell first talked about metacognition,1 which regards knowledge or beliefs about what factors or variables act and interact in what ways to affect the course and outcome of cognitive enterprises.2 It is a sort of “thinking about thinking” that can elucidate insights in fields like healthcare where there's often data overload and optimal mental functioning is tied to important outcomes (life or death). With healthcare having high levels of burnout (37% among medical students3), promoting metacognition since it is a core tenet of mindfulness meditation—which has reduced burnout rates when implemented and has beneficial effects for anxiety and depression treatment on par with SSRIs.
Research into the ability for heuristics to enhance the ability for medical students to generate differential diagnoses provides potential value to the new field of artificial intelligence where prompt engineering can have large impacts on output quality. Daniel Kahneman has proposed a dual processing theory for human reasoning.4 Type 1 processing, being a more direct association between new information and a similar example from memory. When patterns are not recognized,5 the concept of type 2 processing comes into play which is more reliant on computations in one's working memory. Both of these systems, one and two, can override each other. These have been termed a rational (type 2 over type 1) and dysrationalia (type 1 over type 2) override, prompting some to consider a third type of processing that is aware of when such overrides need to occur.6 This metacognition or metamemory has been proposed as a theoretical model for the complex interactions that occur while the human brain is trying to come up with a diagnosis. Metacognitive “feelings of rightness” —a sense that the correct answer is had—seem to be mediated in humans by the fluency information is brought to the forefront of attention for a participant.7 Hypotheses for the mechanism involve a mere exposure effect where unfamiliar stimuli are not liked as much as familiar stimuli more easily recalled.7 Also, there may be a predisposition towards being hesitant about encountering foreign and potentially dangerous challenges in the environment.7
As of our knowledge, this is an unexplored area in the literature in regard to the how large language models like ChatGPT-4 may display awareness and be able to reflect on the quality of its own output.
This paper aims to explore two hypotheses.
1. Using heuristics as prompts to a large language model will increase the diversity and quality of differential diagnosis lists.
2. Prompting with subsequent filtering and ordering commands will help stimulate a metacognition for the large language model and increase diagnostic accuracy when ordering differential diagnoses.
A heuristic is “a strategy that ignores part of the information, with the goal of making decisions more quickly, frugally, and/or accurately than more complex methods.”8 These seem to work because of a concept called optimization under constraints,9 where models do not assume full knowledge, and that there are limitations (e.g. time, information quality, cognitive distortions, memory) for such decisions. The bounded rationality view states that humans use “good enough” tools and solutions for their specific goals for the best possible outcome in the context of these restrictions.9 Studies suggest that heuristics can have academic performance benefit.10–12 There is a debate whether heuristics are more beneficial or detrimental in healthcare13 that is swamped with constraints yet particularly susceptible to cognitive biases. In healthcare, one review found that trying to mobilize and reorganize knowledge may have benefit in improving diagnostic accuracy, whereas trying to identify heuristics and biases had little effect on diagnostic errors.14 Work has been done by Leeds et al.15 in formulating how heuristics can enhance the number of differential diagnoses generated by medical students by 13.3% regarding case vignettes. This study specifically used the techniques of constellations, mental CT scans, VINDICATES acronyms, bundling.
Large language models are a new application of artificial intelligence trained to understand and generate natural language and predict subsequent sequences of words.16 They are built on transformers17 and make associations like how biological neural networks function,18 and thus may have some potential similarity to the human mind as to how it can be modified for better performance output. In the context of healthcare, previous studies have found that ChatGPT-3.5 can achieve a 64% on USMLE-provided free questions for the USMLE Step 1 exam, a barely passing score.19 On Step 2 questions it achieved 58%,19 below passing, but relatively strong accuracy can be had. One study of Merck manual cases found an overall accuracy of 71% across 36 clinical cases. One study found that 93% of the time the correct diagnosis was within the top 10 diagnoses generated by ChatGPT. In that study, physicians did the same and had the correct diagnosis within the top 5 diagnoses 98% of the time whereas with the top 5 diagnoses ChatGPT only included the right answer 83 percent of the time.20 Another study found that the correct diagnosis was found by Chat-GPT-4 within the top 10 Differential diagnosis list 83% of the time within the top 5, 81% of the time, and with the top diagnosis, 60% of the time.21 ChatGPT has other promising uses in the clinical space as well, like answering specific clinical questions and more efficient documentation.22 However, some evidence exists that there are ways to coax certain “System 2” behaviors out of LLMs. Step-by-step questioning for logic-related tasks can improve the reasoning of large language models.23 There is little evidence that large language models have achieved self-awareness, termed a “singularity,”24 but surprising results continue to be published like one study where it was found to perform higher on emotional awareness questionnaires than the general population.25 Per the author’s knowledge, it is not known whether mnemonics and heuristics that seem to improve human differential diagnosis generation can translate to LLMs. Knowledge of such could empower medical students and clinicians alike to better work alongside such technology in the future. Furthermore, there are no studies assessing reflective methods in a generation, filtration, and ordering paradigm15 to increase the quality or ordering or diversity of differential diagnosis with large language models. This study aims to fill in those gaps.
There's a fundamental assertion, however, with such tests of diagnostic accuracy and how they relate to the real world. Having certainty is not always the case in real clinical practice, and so a differential diagnosis that is judged by the quality of the top ordering is not necessarily as helpful a differential diagnosis list as one that accounts, for example, for the most dangerous diagnosis, the most likely, and then also the cost of each next diagnostic step to exclude those differentials.
The ChatGPT model used was ChatGPT4 as of October 25th, 2023, and no plugins were used. For each of the twenty cases and each of the five generation prompts used on each case, a new chat thread was created (25 unique chat threads total) in order to reduce bias from previous chat responses. Prompts to generate a numbered list, filter, and finally order the prompts (exact prompts listed below) were always the same and used in the same order.
For the generation, filtration, and ordering model proposed by Leeds et al., the four identified generation heuristics along with one other— a visual pathway approach—were used.
- “Generate a differential diagnosis list with no explanation using the heuristic of a mental CT scan, where you mentally slice that area of the body and think of the layers of tissue that could be involved: ”
- “Generate a differential diagnosis list with no explanation using the heuristic of bundling, where certain diagnoses tend to frequently co-occur in a differential: ”
- “Generate a differential diagnosis list with no explanation using the heuristic of a constellation, where you group subsets of medically-relevant facts to produce multiple sub-differentials: ”
- “Generate a differential diagnosis list with no explanation using the heuristic VINDICATES, which stands for vascular, infectious, neoplastic, degenerative, iatrogenic/idiopathic, congenital, autoimmune, traumatic, endocrine/metabolic, and social/situational: ”
- “Generate a differential diagnosis list with no explanation using the heuristic of a visual pathway starting from the organ system involved and the site of the body involved, following through the pathophysiologic process step-by-step and coming up with differential diagnoses along the way: ”
USMLE questions—used after the colons of the prompts—were pasted as appeared in the free 120 exams except for the last question line. The five answer choices were also omitted before pasting behind the generation prompts. After an output was received from the initial generation prompts and USMLE question stem, the following prompt was added to consolidate the list from the miscellaneous format ChatGPT provided.
- “Create a numbered list of diagnoses without repeated terms.”
The number of diagnoses generated were then tallied and diversity index calculated. The following prompt in the same chat thread was then given:
- “Filter out the least likely diagnoses.”
The number of diagnoses were again tallied. The final prompt was given:
- “Order the diagnoses from most to least likely.”
Among the 25 unique chat threads, a percentage was calculated of the correct diagnoses listed as the first diagnosis.
A sample chat thread is provided at this link.
The cases used were from free 120 exams for Step 1 and Step 2 provided by the USMLE. Inclusion criteria for the questions included a diagnosis specifically being asked for in the question stem. Exclusion criteria included a picture in the question stem or answer choices referencing a pathophysiologic mechanism rather than the diagnosis itself. Questions that referenced treatment or the next step in diagnosis were also excluded. Questions with information not in paragraph format, like bullet points, were excluded as well. The cases used included a wide range of pathology and various organ systems. Out of 240 questions on the two free 120 exams, a total of 20 were met criteria and were used. This preliminary study used 5 of those.
Simpsons diversity index scores were calculated for the differential diagnoses initially generated before filtering. This score was originally used to calculate biological diversity in ecosystems, but has been adapted for use to evaluate differential diagnosis lists. It takes into account species richness and species evenness and uses the formula , where n = number of diagnoses within each VINDICATES system for each chat thread and N = total number of diagnoses generated within each unique thread.
ChatGPT was used to do automatic sorting into VINDICATES categories, which was double-checked by an MS4. The following prompt was used in a new chat thread before pasting the previously generated differential list:
“Sort these diagnoses into categories of VINDICATES, which stands for vascular, infectious, neoplastic, degenerative, iatrogenic/idiopathic, congenital, autoimmune, traumatic, endocrine/metabolic, and social/situational: ”
In one instance, renal vascular hypertension and renal artery stenosis was interpreted by the author as a correct alternative answer for the general term atherosclerosis, the actual answer choice from the free 120. In another instance, COPD was interpreted for emphysema. There was one case where the filtered list had seven differentials whereas the ordered list had six differentials on the VINDICATES acronym chat for question five. Psychiatric disorders tended to be sorted into the S of the VINDICATES acronym via ChatGPT. When ChatGPT refused to sort diagnoses into categories of indicates, they were manually sorted into diagnosis categories. In some instances, particularly with question three, causes were lumped together as one diagnosis bullet point. For example, “Hemolytic anemia causes (e.g., G6PD deficiency, hereditary spherocytosis, ABO or Rh incompatibility)” was one bullet point in the ChatGPT sorted differential list. In these cases, these example diagnoses were counted as one whole diagnosis, consistent with how ChatGPT sorted it.
Results:
Below are screenshots of a statistical analysis done on data from the first five cases collected in this Excel document where the screenshots were made from.
Discussion:
Notable findings include a significant diversity index result on ANOVA for a comparison between the bundles and Vindicates heuristics and the bundles and CT acronyms. In both cases, the bundle acronym performed worse in regard to the size of the unfiltered initial differential generated. When comparing a mental CT scan, bundles, constellations, vindicates, and visual pathway heuristics, ChatGPT did not seem to have differing behavior in regard to the diversity indices of the unfiltered differential lists. In regard to the correctness of answers after the filtering and ordering stages, there does not seem to be much difference in correct diagnosis, as Case 5 seemed to be challenging for the large language model in all heuristics. Only the VINDICATES acronym treatment had an error on Case 2 when it ordered the differential.
Conclusion:
This preliminary data show that bundling may not be the most effective prompting mechanism for the neural net that is the large-language model CHAT-GPT-4. This seems to correlate slightly with data from Leeds et al. and medical students generating differentials. Heuristics do not seem to modify greatly the diversity of differentials produced, nor the quality of the answers created by ChatGPT to USMLE-style questions. The ordering command does seem to have increased the percentage of diagnoses listed first in the list that are correct, though it is unclear if ChatGPT knew the correct answer yet did not list it first in its differential after the generation prompt, as it was not specifically instructed to do so. Further study in this area to compare the abilities of medical professionals to ChatGPT, and the many potential ways they can benefit each other, is a ripe avenue for research with this unparalleled and novel technological breakthrough.
References:
1. Hong WH, Vadivelu J, Daniel EGS, Sim JH. Thinking about thinking: changes in first-year medical students’ metacognition and its relation to performance. Medical Education Online. 2015;20(1):27561. doi:10.3402/meo.v20.27561
2. Flavell JH. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American Psychologist. 1979;34(10):906-911. doi:10.1037/0003-066X.34.10.906
3. Almutairi H, Alsubaiei A, Abduljawad S, et al. Prevalence of burnout in medical students: A systematic review and meta-analysis. Int J Soc Psychiatry. 2022;68(6):1157-1170. doi:10.1177/00207640221106691
4. Kahneman D. Thinking, Fast and Slow. Macmillan; 2011.
5. Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022-1028. doi:10.1097/ACM.0b013e3181ace703
6. Marcum JA. An integrated model of clinical reasoning: dual-process theory of cognition and metacognition. Journal of Evaluation in Clinical Practice. 2012;18(5):954-961. doi:10.1111/j.1365-2753.2012.01900.x
7. Thompson VA. Dual-process theories: A metacognitive perspective. In: Evans J, Frankish K, eds. In Two Minds: Dual Processes and Beyond. Oxford University Press; 2009:0. doi:10.1093/acprof:oso/9780199230167.003.0008
8. Gigerenzer G, Gaissmaier W. Heuristic Decision Making. Annu Rev Psychol. 2011;62(1):451-482. doi:10.1146/annurev-psych-120709-145346
9. Marewski JN, Gigerenzer G. Heuristic decision making in medicine. Dialogues in Clinical Neuroscience. 2012;14(1):77-89. doi:10.31887/DCNS.2012.14.1/jmarewski
10. Leal L. Investigation of the relation between metamemory and university students’ examination performance. Journal of Educational Psychology. 1987;79(1):35-40. doi:10.1037/0022-0663.79.1.35
11. Zivian MT, Darjes RW. Free recall by in-school and out-of-school adults: Performance and metamemory. Developmental Psychology. 1983;19(4):513.
12. Wolgemuth JR, Cobb RB, Alwell M. The Effects of Mnemonic Interventions on Academic Outcomes for Youth with Disabilities: A Systematic Review. Learning Disabilities Research & Practice. 2008;23(1):1-10. doi:10.1111/j.1540-5826.2007.00258.x
13. Whelehan DF, Conlon KC, Ridgway PF. Medicine and heuristics: cognitive biases and medical decision-making. Ir J Med Sci. 2020;189(4):1477-1484. doi:10.1007/s11845-020-02235-1
14. Norman GR, Monteiro SD, Sherbino J, Ilgen JS, Schmidt HG, Mamede S. The Causes of Errors in Clinical Reasoning: Cognitive Biases, Knowledge Deficits, and Dual Process Thinking. Academic Medicine. 2017;92(1):23-30. doi:10.1097/ACM.0000000000001421
15. Leeds FS, Atwa KM, Cook AM, Conway KA, Crawford TN. Teaching heuristics and mnemonics to improve generation of differential diagnoses. Medical Education Online. 2020;25(1):1742967. doi:10.1080/10872981.2020.1742967
16. Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys. 2023;56(2):1-40.
17. Goyal N, Du J, Ott M, Anantharaman G, Conneau A. Larger-Scale Transformers for Multilingual Masked Language Modeling. Published online May 2, 2021. doi:10.48550/arXiv.2105.00572
18. Zhao L, Zhang L, Wu Z, et al. When brain-inspired AI meets AGI. Meta-Radiology. 2023;1(1):100005. doi:10.1016/j.metrad.2023.100005
19. Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education. 2023;9(1):e45312. doi:10.2196/45312
20. Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. International Journal of Environmental Research and Public Health. 2023;20(4):3378. doi:10.3390/ijerph20043378
21. Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case–Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Medical Informatics. 2023;11(1):e48808. doi:10.2196/48808
22. Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. Journal of Medical Internet Research. 2023;25(1):e48568. doi:10.2196/48568
23. Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. Advances in neural information processing systems. 2022;35:22199-22213.
24. Wang J. Self-Awareness, a Singularity of AI. Philosophy. 2023;13(2):68-77.
25. Elyoseph Z, Hadar-Shoval D, Asraf K, Lvovsky M. ChatGPT outperforms humans in emotional awareness evaluations. Frontiers in Psychology. 2023;14. Accessed October 25, 2023. https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1199058