I have been checking various #AI #LLM systems with #Moodle #AIText to evaluate the correctness of programming code. Students would submit trivial amounts of code and it would point out errors such as missing braces or parenthesis.
The results are wildly inconsistent across multiple systems. It will give an accurate evaluation on one submission and then overlook an error that a casual human observer would spot on the next.
Tested against #OpenAI, #ChatGPT Google Gemini, Anthropic & others.