Generative AI

Journal Article Annotations
2025, 3rd Quarter

Generative AI

Annotations by Liliya Gershengoren, MD
September, 2025

  1. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.
  2. Choices of Artificial Intelligence (AI): ChatGPT’s Solutions to Ethical Dilemmas in Bipolar Disorder Care.

PUBLICATION #1 — Generative AI

Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.
Ayoub Bouguettaya, Elizabeth M Stuart, Elias Aboujaoude.

Annotation

The finding:
This study found that large language models (LLM), when provided with patient cases containing explicit or implied racial information, frequently generated biased treatment recommendations for African American patients. Diagnostic outputs were relatively consistent, but treatment suggestions varied in concerning ways—for example, omitting medications, emphasizing substance use, or recommending guardianship in scenarios where race was specified. Schizophrenia and anxiety cases showed the highest treatment bias. Among the models tested, the locally trained NewMes-15 demonstrated the greatest bias, while Gemini was the least biased.

Strength and weaknesses:
The study is the first to systematically evaluate racial bias in psychiatric outputs across multiple LLMs and diagnostic categories, including both implicit and explicit racial cues. It used a structured methodology with blinded raters from different professional backgrounds (social psychology, clinical neuropsychology, psychiatry), and interrater reliability was very high. By comparing both commercial and locally trained models, the study provides unique insights into how training context and model scale affect bias expression. Importantly, it highlights how even subtle racial cues (like names or dialect) can alter AI-generated recommendations.

There are several notable limitations in this study. The study relied on a relatively small number of cases (10 total, two per diagnosis), which limits generalizability and statistical power. Bias ratings were qualitative and subjective, based on expert judgment, which may introduce variability despite strong agreement between raters. The outputs were assessed from a single response per condition rather than multiple trials, so consistency of bias could not be fully determined. Additionally, because AI models evolve rapidly, the findings represent only a snapshot in time and may not apply to future versions. Finally, the “neutral” condition may still have contained implicit cues, raising the possibility that LLMs inferred race even when it was not intended.

Relevance:
For consultation-liaison psychiatrists, this research underscores both the promise and peril of using AI systems in psychiatric practice. While LLMs can assist with documentation and decision support, they may also introduce or amplify racial disparities in treatment planning. C-L psychiatrists, who often manage complex, high-stakes cases with medically ill and racially diverse populations, need to be especially vigilant. Awareness of AI bias is crucial when integrating these tools into clinical workflows. These findings call for caution, critical appraisal of AI outputs, and advocacy for rigorous testing and bias mitigation strategies in the medical setting.


PUBLICATION #2 — Generative AI

Choices of Artificial Intelligence (AI): ChatGPT’s Solutions to Ethical Dilemmas in Bipolar Disorder Care.
Russell Franco D’Souza, Krishna Mohan Surapaneni, Mary Mathew, Shabbir Amanullah, Rajiv Tandon.

Annotation

The finding:
This study examined how ChatGPT addressed ethical dilemmas in bipolar disorder care across three case scenarios. ChatGPT’s responses aligned with expert-derived answer keys in most instances, correctly addressing issues of autonomy, beneficence, nonmaleficence, and confidentiality. In six out of fourteen sub-questions, ChatGPT’s responses diverged, often because it emphasized a different ethical principle (e.g., focusing on autonomy rather than confidentiality). The study found that AI can provide reasonable and ethically grounded perspectives, but its reasoning may diverge from human experts depending on how ethical principles are prioritized.

Strength and weaknesses:
A strength of this study is its innovative focus on ethical dilemmas, a domain often overlooked in AI–psychiatry research. By using well-established case scenarios and comparing ChatGPT’s responses to an expert answer key, the study provided a structured way to evaluate AI’s ethical reasoning. Another strength lies in the nuanced analysis: rather than labeling responses simply as correct or incorrect, the authors considered why ChatGPT’s ethical emphasis diverged, offering insights into AI’s interpretive patterns. Additionally, the findings underscore the potential of AI as a supplementary tool in shared decision-making, particularly in ethically charged clinical settings.

The study’s main limitation is its small sample size, restricted to three case scenarios with fourteen sub-questions. This narrow scope limits generalizability and may not reflect the diversity of ethical dilemmas encountered in real-world bipolar disorder care. Another limitation is the reliance on ChatGPT 3.5, an earlier model, meaning the findings may not fully apply to newer, more advanced versions. Furthermore, the study did not assess consistency by repeating prompts under varying conditions, leaving open the possibility of variability in AI outputs. Finally, because ethical dilemmas inherently allow for multiple valid approaches, discrepancies between ChatGPT and the answer key cannot be interpreted as purely errors but instead reflect subjective prioritization of ethical principles.

Relevance:
For consultation-liaison psychiatrists, this study is highly relevant because ethical dilemmas are a core part of daily clinical practice, especially when managing acutely ill, high-risk patients with bipolar disorder. The findings suggest that AI tools like ChatGPT can offer useful frameworks or reminders of core ethical principles during decision-making but should never replace human clinical judgment. C-L psychiatrists may find value in AI as a supportive tool for teaching trainees, facilitating ethics discussions, and providing rapid overviews of complex cases. However, this research underscores the necessity of skepticism and critical appraisal when integrating AI into ethically sensitive contexts, reinforcing that ultimate responsibility must remain with clinicians.