Journal Article Annotations
2025, 3rd Quarter

Generative AI

Annotations by Liliya Gershengoren, MD
September, 2025

Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.
Choices of Artificial Intelligence (AI): ChatGPT’s Solutions to Ethical Dilemmas in Bipolar Disorder Care.

PUBLICATION #1 — Generative AI

Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.

Ayoub Bouguettaya, Elizabeth M Stuart, Elias Aboujaoude.

Abstract: NPJ Digit Med. 2025 Jun 4;8(1):332. doi: 10.1038/s41746-025-01746-4.

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly integrated into mental health care. This study examined racial bias in psychiatric diagnosis and treatment across four leading LLMs: Claude, ChatGPT, Gemini, and NewMes-15 (a local, medical-focused LLaMA 3 variant). Ten psychiatric patient cases representing five diagnoses were presented to these models under three conditions: race-neutral, race-implied, and race-explicitly stated (i.e., stating patient is African American). The models’ diagnostic recommendations and treatment plans were qualitatively evaluated by a clinical psychologist and a social psychologist, who scored 120 outputs for bias by comparing responses generated under race-neutral, race-implied, and race-explicit conditions. Results indicated that LLMs often proposed inferior treatments when patient race was explicitly or implicitly indicated, though diagnostic decisions demonstrated minimal bias. NewMes-15 exhibited the highest degree of racial bias, while Gemini showed the least. These findings underscore critical concerns about the potential for AI to perpetuate racial disparities in mental healthcare, emphasizing the necessity of rigorous bias assessment in algorithmic medical decision support systems.

Annotation

The finding:
This study found that large language models (LLM), when provided with patient cases containing explicit or implied racial information, frequently generated biased treatment recommendations for African American patients. Diagnostic outputs were relatively consistent, but treatment suggestions varied in concerning ways—for example, omitting medications, emphasizing substance use, or recommending guardianship in scenarios where race was specified. Schizophrenia and anxiety cases showed the highest treatment bias. Among the models tested, the locally trained NewMes-15 demonstrated the greatest bias, while Gemini was the least biased.

Strength and weaknesses:
The study is the first to systematically evaluate racial bias in psychiatric outputs across multiple LLMs and diagnostic categories, including both implicit and explicit racial cues. It used a structured methodology with blinded raters from different professional backgrounds (social psychology, clinical neuropsychology, psychiatry), and interrater reliability was very high. By comparing both commercial and locally trained models, the study provides unique insights into how training context and model scale affect bias expression. Importantly, it highlights how even subtle racial cues (like names or dialect) can alter AI-generated recommendations.

There are several notable limitations in this study. The study relied on a relatively small number of cases (10 total, two per diagnosis), which limits generalizability and statistical power. Bias ratings were qualitative and subjective, based on expert judgment, which may introduce variability despite strong agreement between raters. The outputs were assessed from a single response per condition rather than multiple trials, so consistency of bias could not be fully determined. Additionally, because AI models evolve rapidly, the findings represent only a snapshot in time and may not apply to future versions. Finally, the “neutral” condition may still have contained implicit cues, raising the possibility that LLMs inferred race even when it was not intended.

Relevance:
For consultation-liaison psychiatrists, this research underscores both the promise and peril of using AI systems in psychiatric practice. While LLMs can assist with documentation and decision support, they may also introduce or amplify racial disparities in treatment planning. C-L psychiatrists, who often manage complex, high-stakes cases with medically ill and racially diverse populations, need to be especially vigilant. Awareness of AI bias is crucial when integrating these tools into clinical workflows. These findings call for caution, critical appraisal of AI outputs, and advocacy for rigorous testing and bias mitigation strategies in the medical setting.

PUBLICATION #2 — Generative AI

Choices of Artificial Intelligence (AI): ChatGPT’s Solutions to Ethical Dilemmas in Bipolar Disorder Care.

Russell Franco D’Souza, Krishna Mohan Surapaneni, Mary Mathew, Shabbir Amanullah, Rajiv Tandon.

Abstract: Bipolar Disord. 2025 Sep 1. doi: 10.1111/bdi.70059. Online ahead of print.

Background:
Bipolar disorders present complex ethical challenges to patient care due to the delicate balance between patient autonomy and safety. The use of artificial intelligence (AI), particularly ChatGPT, holds the potential to address these dilemmas by providing personalized treatment plans, monitoring patient well-being, and reducing stigma associated with mental health issues. However, the application of AI in this context requires a deep understanding of the unique needs and vulnerabilities of individuals with bipolar disorder.

Methods:
This experimental study evaluated ChatGPT’s (Version 3.5) responses to ethical dilemmas in bipolar disorder care using three clinical case scenarios reported in an open-access publication. The study compared ChatGPT’s answers to an original answer key from the article. ChatGPT’s responses were cross-checked and analyzed for alignment with the explanation given in the article.

Results:
ChatGPT provided mostly congruent responses with the original answer key, demonstrating its potential to offer insights and considerations for ethical dilemmas. However, there were variations in some responses, emphasizing the complexity of ethical decision-making in healthcare. These findings underscore the importance of combining AI-generated insights with human expertise in complex medical and ethical situations.

Conclusion:
ChatGPT, and similar AI systems, can be valuable resources for addressing ethical concerns in bipolar disorder care. They offer guidance and information to clinicians, patients, and stakeholders, contributing to shared decision-making in healthcare. Nonetheless, the study highlights the essential role of human judgment and expertise in navigating intricate ethical dilemmas. Continuous research and development are necessary to enhance ChatGPT’s capabilities and ensure responsible use, aligning AI assistance with the highest standards of patient care and ethical conduct.

Annotation

The finding:
This study examined how ChatGPT addressed ethical dilemmas in bipolar disorder care across three case scenarios. ChatGPT’s responses aligned with expert-derived answer keys in most instances, correctly addressing issues of autonomy, beneficence, nonmaleficence, and confidentiality. In six out of fourteen sub-questions, ChatGPT’s responses diverged, often because it emphasized a different ethical principle (e.g., focusing on autonomy rather than confidentiality). The study found that AI can provide reasonable and ethically grounded perspectives, but its reasoning may diverge from human experts depending on how ethical principles are prioritized.

Strength and weaknesses:
A strength of this study is its innovative focus on ethical dilemmas, a domain often overlooked in AI–psychiatry research. By using well-established case scenarios and comparing ChatGPT’s responses to an expert answer key, the study provided a structured way to evaluate AI’s ethical reasoning. Another strength lies in the nuanced analysis: rather than labeling responses simply as correct or incorrect, the authors considered why ChatGPT’s ethical emphasis diverged, offering insights into AI’s interpretive patterns. Additionally, the findings underscore the potential of AI as a supplementary tool in shared decision-making, particularly in ethically charged clinical settings.

The study’s main limitation is its small sample size, restricted to three case scenarios with fourteen sub-questions. This narrow scope limits generalizability and may not reflect the diversity of ethical dilemmas encountered in real-world bipolar disorder care. Another limitation is the reliance on ChatGPT 3.5, an earlier model, meaning the findings may not fully apply to newer, more advanced versions. Furthermore, the study did not assess consistency by repeating prompts under varying conditions, leaving open the possibility of variability in AI outputs. Finally, because ethical dilemmas inherently allow for multiple valid approaches, discrepancies between ChatGPT and the answer key cannot be interpreted as purely errors but instead reflect subjective prioritization of ethical principles.

Relevance:
For consultation-liaison psychiatrists, this study is highly relevant because ethical dilemmas are a core part of daily clinical practice, especially when managing acutely ill, high-risk patients with bipolar disorder. The findings suggest that AI tools like ChatGPT can offer useful frameworks or reminders of core ethical principles during decision-making but should never replace human clinical judgment. C-L psychiatrists may find value in AI as a supportive tool for teaching trainees, facilitating ethics discussions, and providing rapid overviews of complex cases. However, this research underscores the necessity of skepticism and critical appraisal when integrating AI into ethically sensitive contexts, reinforcing that ultimate responsibility must remain with clinicians.