Objectives: Celiac disease (CeD) is a common autoimmune condition requiring lifelong adherence to a gluten-free diet (GFD). Patients and caregivers increasingly seek information online, and large language models (LLMs) have emerged as potential educational tools. However, their reliability in CeD remains uncertain. This study aimed to evaluate the performance of three popular LLMs in answering frequently asked questions (FAQs) about CeD and GFD management. Methods: We conducted a cross-sectional comparative evaluation in which 12 FAQs were submitted to three LLMs: ChatGPT-4 (OpenAI), Gemini Flash 2.5 (Google), and Claude Sonnet 3.7 (Anthropic). Six pediatric gastroenterologists with expertise in CeD research and education, independently assessed and rated responses for accuracy, completeness, clarity, and overall quality using a 5-point Likert scale. Results: The mean overall score across models was 4.3 ± 0.35 out of 5. Clarity received the highest ratings (4.56 ± 0.21), followed by accuracy (4.26 ± 0.52), completeness (4.17 ± 0.21), and overall quality (4.20 ± 0.36). Responses to management-related questions scored significantly higher than those to diagnostic questions (4.4 vs. 4.2, p = 0.013). Inter-rater reliability was good (intraclass correlation coefficient = 0.74). Overall, Gemini achieved the highest ratings (p < 0.01). Conclusions: LLMs provide clear and generally accurate responses to CeD FAQs, particularly on management-related topics. While they represent a promising tool for patient education, variability in accuracy highlights the need for clinician oversight when interpreting artificial intelligence-generated medical information.

Performance of large language models in answering frequently‐asked questions on celiac disease

Valitutti, Francesco;
2026

Abstract

Objectives: Celiac disease (CeD) is a common autoimmune condition requiring lifelong adherence to a gluten-free diet (GFD). Patients and caregivers increasingly seek information online, and large language models (LLMs) have emerged as potential educational tools. However, their reliability in CeD remains uncertain. This study aimed to evaluate the performance of three popular LLMs in answering frequently asked questions (FAQs) about CeD and GFD management. Methods: We conducted a cross-sectional comparative evaluation in which 12 FAQs were submitted to three LLMs: ChatGPT-4 (OpenAI), Gemini Flash 2.5 (Google), and Claude Sonnet 3.7 (Anthropic). Six pediatric gastroenterologists with expertise in CeD research and education, independently assessed and rated responses for accuracy, completeness, clarity, and overall quality using a 5-point Likert scale. Results: The mean overall score across models was 4.3 ± 0.35 out of 5. Clarity received the highest ratings (4.56 ± 0.21), followed by accuracy (4.26 ± 0.52), completeness (4.17 ± 0.21), and overall quality (4.20 ± 0.36). Responses to management-related questions scored significantly higher than those to diagnostic questions (4.4 vs. 4.2, p = 0.013). Inter-rater reliability was good (intraclass correlation coefficient = 0.74). Overall, Gemini achieved the highest ratings (p < 0.01). Conclusions: LLMs provide clear and generally accurate responses to CeD FAQs, particularly on management-related topics. While they represent a promising tool for patient education, variability in accuracy highlights the need for clinician oversight when interpreting artificial intelligence-generated medical information.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11391/1616239
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact