Objectives: This study assessed the accuracy and consistency of generative AI large language models - specifically, OpenAI’s GPT-4 and GPT-3.5, and Google’s Bard - in answering US Department of Agriculture (USDA)-designed nutrition quizzes for children's education. These models may have an emerging role in supporting student personalized learning in the future.
Methods: Online USDA quizzes across 16 nutrition categories (151 total questions) were entered into each of the three AI models. Each quiz was iterated five times (5 runs) to evaluate consistency and accuracy. The data underwent statistical analysis, including t-tests, two-way ANOVA with repetition, Tukey-Kramer test, and Fleiss kappa, to evaluate model performance.
Results: Mean accuracies, averaged from the 5 runs across 16 quizzes, were 93% for GPT-4 (range 92.1-95.4%), 88% for GPT-3.5 (85.4-91.4%), and 89% for Bard (88.7-89.4%), with no significant differences (p >0.05). There was also no significant difference in the consistency of answers across each set of 5 runs (Fleiss kappa concordance values of 0.97, 0.97, and 0.98), though GPT-3.5 showed the most variability in consistency (6% spread), compared to GPT-4 (3% spread) and Bard (1% spread).
Among the 16 nutrition subject quizzes, Meal Components had the lowest mean scores at 68%, 68%, and 82% respectively, whereas only School Nutrition Environment and Older Adult Nutrition achieved 100% in all models. The largest variation in answers was in Child Nutrition Labels, where GPT-4 had a 30% spread (60% to 90%) from run to run, GPT-3.5 a 20% spread, and Bard a 20% spread. Different models performed better on different subject quizzes (p < 0.05 using two-way ANOVA).
Conclusions: Generative AI language models, due to their accuracy and consistency as shown in this study, have potential for use in providing reliable nutrition education, critical for personalized learning in children. However, varying performances across quiz topics and between different models underscore the importance of monitoring content type in their potential educational integration.
Funding Sources: This study did not receive any funding.