Researchers created a chatbot to help teach a university law class – but the AI kept messing up

Detta inlägg post publicerades ursprungligen på denna sida this site ;


Date:

Author: Armin Alimardani, Senior Lecturer in Law and Emerging Technologies, University of Wollongong

Original article: https://theconversation.com/researchers-created-a-chatbot-to-help-teach-a-university-law-class-but-the-ai-kept-messing-up-257551


Mikhail Nilov/ Pexels , CC BY

“AI tutors” have been hyped as a way to revolutionise education.

The idea is generative artificial intelligence tools (such as ChatGPT) could adapt to any teaching style set by a teacher. The AI could guide students step-by-step through problems and offer hints without giving away answers. It could then deliver precise, immediate feedback tailored to the student’s individual learning gaps.

Despite the enthusiasm, there is limited research testing how well AI performs in teaching environments, especially within structured university courses.

In our new study, we developed our own AI tool for a university law class. We wanted to know, can it genuinely support personalised learning or are we expecting too much?

Our study

In 2022, we developed SmartTest, a customisable educational chatbot, as part of a broader project to democratise access to AI tools in education.

Unlike generic chatbots, SmartTest is purpose-built for educators, allowing them to embed questions, model answers and prompts. This means the chatbot can ask relevant questions, deliver accurate and consistent feedback and minimise hallucinations (or mistakes). SmartTest is also instructed to use the Socratic method, encouraging students to think, rather than spoon-feeding them answers.

We trialled SmartTest over five test cycles in a criminal law course (which one of us was coordinating) at the University of Wollongong in 2023.

Each cycle introduced varying degrees of complexity. The first three cycles used short hypothetical criminal law scenarios (for example, is the accused guilty of theft in this scenario?). The last two cycles used simple short-answer questions (for example, what’s the maximum sentencing discount for a guilty plea?).

An average of 35 students interacted with SmartTest in each cycle across several criminal law tutorials. Participation was voluntary and anonymous, with students interacting with SmartTest on their own devices for up to ten minutes per session. Students’ conversations with SmartTest – their attempts at answering the question, and the immediate feedback they received from the chatbot – were recorded in our database.

After the final test cycle, we surveyed students about their experience.



What we found

SmartTest showed promise in guiding students and helping them identify gaps in their understanding.

However, in the first three cycles (the problem-scenario questions), between 40% and 54% of conversations had at least one example of inaccurate, misleading, or incorrect feedback.

When we shifted to much simpler short-answer format in cycles four and five, the error rate dropped significantly to between 6% and 27%. However, even in these best-performing cycles, some errors persisted. For example, sometimes SmartTest would affirm an incorrect answer before providing the correct one, which risks confusing students.

A significant revelation was the sheer effort required to get the chatbot working effectively in our tests. Far from a time-saving silver bullet, integrating SmartTest involved painstaking prompt engineering and rigorous manual assessments from educators (in this case, us). This paradox – where a tool promoted as labour-saving demands significant labour – calls into question its practical benefits for already time-poor educators.

Inconsistency is a core issue

SmartTest’s behaviour was also unpredictable. Under identical conditions, it sometimes offered excellent feedback and at other times provided incorrect, confusing or misleading information.

For an educational tool tasked with supporting student learning, this raises serious concerns about reliability and trustworthiness.

To assess if newer models improved performance, we replaced the underlying generative AI powering SmartTest (ChatGPT-4) with newer models, such as ChatGPT-4.5, which was released in 2025.

We tested these models by replicating instances where SmartTest provided poor feedback to students in our study. The newer models did not consistently outperform older ones. Sometimes, their responses were even less accurate or useful from a teaching perspective. As such, newer more advanced AI models do not automatically translate to better educational outcomes.

What does this mean for students and teachers?

The implications for students and university staff are mixed.

Generative AI may support low-stakes, formative learning activities. But in our study, it could not provide the reliability, nuance and subject-matter depth needed for many educational contexts.

On the plus side, our survey results indicated students appreciated the immediate feedback and conversational tone of SmartTest. Some mentioned it reduced anxiety and made them more comfortable expressing uncertainty. However, this benefit came with a catch: incorrect or misleading answers could just as easily reinforce misunderstandings as clarify them.

Most students (76%) preferred having access to SmartTest rather than no opportunity to practise questions. However, when given the choice between receiving immediate feedback from AI or waiting one or more days for feedback from human tutors, only 27% preferred AI. Nearly half preferred human feedback with a delay and the rest were indifferent.

This suggests a critical challenge. Students enjoy the convenience of AI tools, but they still place higher trust in human educators.

A need for caution

Our findings suggest generative AI should still be treated as an experimental educational aid.

The potential is real – but so are the limitations. Relying too heavily on AI without rigorous evaluation risks compromising the very educational outcomes we are aiming to enhance.

The Conversation

Armin Alimardani previously had a short-term, part-time contract with OpenAI as a consultant. The organisation had no input into the study featured in this article. The views expressed in this article are those of the authors.

This work was supported by the Early-Mid-Career Researcher Enabling Grants Scheme, University of Wollongong (2022, Project ID: R5829).

This work was supported by the School Research Grant, School of the Arts and Media (SAM), UNSW Sydney (2023, Project ID: PS68922); the Research Infrastructure Scheme, Faculty of Arts, Design, and Architecture, UNSW Sydney (2023, Project ID: PS68745); and the School Research Grant, SAM, UNSW Sydney (2022, Project ID: PS66264).