LLMs in education

In writing

Mahapatra, S. (2024). Impact of ChatGPT on ESL students’ academic writing skills: A mixed methods intervention study. Smart Learning Environments, 11. doi:10.1186/s40561-024-00295-9

An experimental study using ChatGPT to help ESL students learn writing. Students were trained to use it to get feedback on the content, organization, and style. Results showed higher scores (on a writing rubric) for the ChatGPT group in post-tests. Interviews with students showed they found the feedback helpful. However, the writing tasks were paragraphs of 150 words and the paper is vague about whether the post-tests were conducted with or without ChatGPT access, so it’s unclear if this tests if the students learned how to write without ChatGPT’s aid or if the graders just prefer ChatGPT-mediated output; and whether this generalizes to longer writing forms.
Oppenheimer, D. M., Cash, T. N., & Connell Pensky, A. E. (2025). You’ve got AI friend in me: LLMs as collaborative learning partners. doi:https://doi.org/10.31219/osf.io/8q67u_v2

A similar experimental intervention in a large (154 students) intro course. Students used ChatGPT to help improve their argumentative essays, turned in their drafts and chat transcripts, and provided self-reflections on the revision process. They wrote five essays throughout the course, and performance on the final essay was evaluated before they got LLM feedback, to test if they had improved. Found significant improvements in writing. But there was no control group, so it’s unclear how ChatGPT compares to their progress without AI assistance.

For coding

Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research (Vol. 1, pp. 78–92). doi:10.1145/3568813.3600142

Evaluates GPT-4 (not 4o or any of the later, more coding-focused models) on undergrad Python course exercises. (These are online courses for certifications, not advanced CS courses.) Finds GPT-4 can likely earn a passing grade. I suspect this would be true of newer LLMs and our own undergraduate statistical computing course.
McDanel, B., & Novak, E. (2025). Designing LLM-resistant programming assignments: Insights and strategies for CS educators. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 756–762). doi:10.1145/3641554.3701872

Tests GPT-4o and Claude 3.5 Sonnet on SIGCSE’s Nifty Problems, which are basically papers presenting clever CS homework assignments and projects. Claude does particularly well, succeeding at most assignments, though a few assignments were consistently hard for the LLMs. Figure 4 summarizes what LLMs were bad at: anything involving visual output (despite being multimodal, LLMs are not very good at interpreting images), tasks requiring sequences of many steps, and, oddly, very detailed and specific assignment prompts: “Assignments that limit the solution space by giving detailed, clear, and explicit instructions are challenging for LLMs. Open-ended and greenfield projects are easier.”
Prather, J., Reeves, B. N., Leinonen, J., MacNeil, S., Randrianasolo, A. S., Becker, B. A., Kimmel, B., Wright, J., & Briggs, B. (2024). The widening gap: The benefits and harms of generative AI for novice programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research (Vol. 1, pp. 469–486). doi:10.1145/3632620.3671116

Fascinating think-aloud interview study with novice programmers using GitHub Copilot and ChatGPT to do a simple programming task. Half the subjects, who knew what they wanted and had a rough idea of strategy, did quite well. Half, however, struggled greatly: they were derailed by constant interruptions from Copilot offering suggestions, sidetracked by suggestions that actually solved a different problem, got stuck in misunderstandings of the task and couldn’t get un-stuck, and so on. Most ultimately finished the task, but slowly. They conclude: “From the evidence presented above, it appears that most of these ten who struggled thought they understood more than they actually did. The patterns of behavior above describe how participants were often led along by GenAI such that each step was able to be rationalized as understanding, making it even more difficult for participants to assess their own learning.” They caution that studies saying novice programmers self-report finding AI useful may be misleading, because their own interviews show students being derailed by it and then claiming it was helpful, perhaps because the students don’t have the metacognitive skills to recognize the result was wrong.
Thorgeirsson, S., Ewen, T., & Su, Z. (2025). What can computer science educators learn from the failures of top-down pedagogy? In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 1127–1133). doi:10.1145/3641554.3701873

Creates a dichotomy of teaching approaches: bottom-up approaches start with the basic core skills and work up to integrate them, while top-down approaches start with the overall goal and work downward. Phonics, in which children learn to read by practicing sounding out individual words and then progressing to whole sentences, is bottom-up; whole-language reading, where students start reading entire sentences early and guess how to read unfamiliar words from context, is top-down. Whole-language reading was trendy until recently, but it turns out that all the empirical evidence suggests it doesn’t work. The authors worry that “approaches that rely heavily on large language models (LLMs) or ‘prompt programming’ run the risk of being the computer science equivalent of whole language, focusing heavily on end results without understanding the underlying mechanics”. They review several studies, including some cited above, that top-down LLM-based approaches may leave students with weak metacognitive skills. But, they note, the other part of education is engagement and motivation – whole-language reading is much more fun for teachers, because they get to read stories instead of sounding out words; and LLM-based education might be more fun because of the fancy new technology. And you can’t pick a pedagogy without considering whether it will motivate students and teachers.