Performance of large language models in answering frequently‐asked questions on celiac disease

Peled, Nadav; Shouval, Dror S.; Gillett, Peter; Szajewska, Hania; Valitutti, Francesco; Shamir, Raanan; Guz‐mark, Anat

doi:10.1002/jpn3.70375

Objectives: Celiac disease (CeD) is a common autoimmune condition requiring lifelong adherence to a gluten-free diet (GFD). Patients and caregivers increasingly seek information online, and large language models (LLMs) have emerged as potential educational tools. However, their reliability in CeD remains uncertain. This study aimed to evaluate the performance of three popular LLMs in answering frequently asked questions (FAQs) about CeD and GFD management. Methods: We conducted a cross-sectional comparative evaluation in which 12 FAQs were submitted to three LLMs: ChatGPT-4 (OpenAI), Gemini Flash 2.5 (Google), and Claude Sonnet 3.7 (Anthropic). Six pediatric gastroenterologists with expertise in CeD research and education, independently assessed and rated responses for accuracy, completeness, clarity, and overall quality using a 5-point Likert scale. Results: The mean overall score across models was 4.3 ± 0.35 out of 5. Clarity received the highest ratings (4.56 ± 0.21), followed by accuracy (4.26 ± 0.52), completeness (4.17 ± 0.21), and overall quality (4.20 ± 0.36). Responses to management-related questions scored significantly higher than those to diagnostic questions (4.4 vs. 4.2, p = 0.013). Inter-rater reliability was good (intraclass correlation coefficient = 0.74). Overall, Gemini achieved the highest ratings (p < 0.01). Conclusions: LLMs provide clear and generally accurate responses to CeD FAQs, particularly on management-related topics. While they represent a promising tool for patient education, variability in accuracy highlights the need for clinician oversight when interpreting artificial intelligence-generated medical information.