Comparative evaluation of artificial intelligence chatbots in answering electroencephalography-related questions

Proença, Soraia; Soares, Joana Isabel; Parra, Joana; Maia, Gisela; Leite, Juliana; Beniczky, Sándor; Jesus-Ribeiro, Joana; Henrique Maia, Gisela Maria

Publicação

Comparative evaluation of artificial intelligence chatbots in answering electroencephalography-related questions

2025-12-05Artigo científico

dc.contributor.author	Proença, Soraia
dc.contributor.author	Soares, Joana Isabel
dc.contributor.author	Parra, Joana
dc.contributor.author	Maia, Gisela
dc.contributor.author	Leite, Juliana
dc.contributor.author	Beniczky, Sándor
dc.contributor.author	Jesus-Ribeiro, Joana
dc.contributor.author	Henrique Maia, Gisela Maria
dc.date.accessioned	2025-12-18T12:31:22Z
dc.date.available	2025-12-18T12:31:22Z
dc.date.issued	2025-12-05
dc.description.abstract	As large language models (LLMs) become more accessible, they may be used to explain challenging EEG concepts to nonspecialists. This study aimed to compare the accuracy, completeness, and readability of EEG-related responses from three LLM-based chatbots and to assess inter-rateragreement. One hundred questions, covering 10 EEG categories, were entered into ChatGPT, Copilot, and Gemini. Six raters from the clinical neurophysiology field (two physicians, two teachers, and two technicians) evaluated the responses. Accuracy was rated on a 6-point scale, completeness on a 3-point scale, and readability was assessed using the Automated Readability Index (ARI). We used a repeated-measures ANOVA for group differences in accuracy and readability, the intraclass correlation coefficient (ICC) for inter-raterreliability, and a two way ANOVA, with chatbot and raters as factors, for completeness. Total accuracy was significantly higher for ChatGPT (mean ± SD 4.54 ± .05) compared with Copilot (mean ± SD 4.11 ± .08) and Gemini (mean ± SD 4.16 ± .13) (p < .001). ChatGPT's lowest performance was in normal variants and patterns of uncertain significance (mean ± SD 3.10 ± .14), while Copilot and Gemini performed lowest in ictal EEG patterns (mean ± SD 2.93 ± .11 and 3.37 ± .24, respectively). Although inter-rater agreement for accuracy was excellent among physicians (ICC = .969) and teachers (ICC = .926), it was poor for technicians in several EEG categories. ChatGPT achieved significantly higher completeness scores than Copilot (p < .001) and Gemini (p = .01). ChatGPT text (ARI − mean ± SD 17.41 ± 2.38) was less readable than Copilot (ARI −mean ± SD 11.14 ± 2.60) (p < .001) and Gemini (ARI − mean ± SD 14.16 ± 3.33). Chatbots achieved relatively high accuracy, but not without flaws, emphasizing that the information provided requires verification. ChatGPT outperformed the other chatbots in accuracy and completeness, though at the expense of readability. The lower inter-rater agreement among technicians may reflect a gap in standardized training or practical experience, potentially impacting the consistency of EEG-related content assessment.	eng
dc.identifier.citation	Proença, S., Soares, J. I., Parra, J., Maia, G., Leite, J., Beniczky, S., & Jesus-Ribeiro, J. (2025). Comparative evaluation of artificial intelligence chatbots in answering electroencephalography-related questions. Epileptic Disorders, 1–11. https://doi.org/10.1002/epd2.70156
dc.identifier.doi	10.1002/epd2.70156
dc.identifier.eissn	1950-6945
dc.identifier.issn	1294-9361
dc.identifier.uri	http://hdl.handle.net/10400.22/31262
dc.language.iso	eng
dc.peerreviewed	yes
dc.publisher	Wiley
dc.relation.hasversion	https://onlinelibrary.wiley.com/doi/10.1002/epd2.70156
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Artificial intelligence
dc.subject	ChatGPT
dc.subject	Copilot
dc.subject	Electroencephalography
dc.subject	Gemini
dc.subject	Large language model
dc.title	Comparative evaluation of artificial intelligence chatbots in answering electroencephalography-related questions	eng
dc.type	journal article
dspace.entity.type	Publication
oaire.citation.endPage	11
oaire.citation.startPage	1
oaire.citation.title	Epileptic Disorders
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85
person.familyName	Henrique Maia
person.givenName	Gisela Maria
person.identifier.ciencia-id	2014-5C31-EBF4
person.identifier.orcid	0000-0002-3199-340X
relation.isAuthorOfPublication	0bdd4bff-1f99-4630-b3c2-d5759b95a2eb
relation.isAuthorOfPublication.latestForDiscovery	0bdd4bff-1f99-4630-b3c2-d5759b95a2eb

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: ART_Gisela Maia.pdf
Tamanho:: 15.72 MB
Formato:: Adobe Portable Document Format

Ver/Abrir

Licença

A mostrar 1 - 1 de 1

Nome:: license.txt
Tamanho:: 4.03 KB
Formato:: Item-specific license agreed upon to submission
Descrição:

Ver/Abrir

Coleções

ESS - TBIO - Artigos
ESS - NEU - Artigos