Home // CONTENT 2025, The Seventeenth International Conference on Creative Content Technologies // View article
Authors:
Yurij Mikhalevich
Keywords: code generation; large language models; AI agents; natural language processing
Abstract:
This paper researches the application of state-of-the-art large language models to autonomously solve real-world software engineering problems based on the problem description intended for humans. For this research, we picked 10 outstanding GitHub issues of different difficulty levels in the Aibyss project. We tasked an AI agent to autonomously solve them based solely on the GitHub Issue description intended for human software engineers. As part of this research, we compared the following large language models: Claude Sonnet 3.7, DeepSeek-V3, DeepSeek-R1, and o3-mini-high. We used the Aider agent to solve the problems. Additionally, we have evaluated the Claude Code agent as one of the best closed-source AI software engineering agents. We have found that the best performance is achieved by Claude Sonnet 3.7 with reasoning enabled – with the Aider agent and the Claude Code agent. Both of them provided working solutions to 5 out of 10 GitHub issues. We analyze the agents’ behaviors, including reasoning steps, common failure modes, and the impact of reasoning tokens. The results highlight both the promise and the current limitations of autonomous LLM-based software engineering.
Pages: 13 to 19
Copyright: Copyright (c) IARIA, 2025
Publication date: April 6, 2025
Published in: conference
ISSN: 2308-4162
ISBN: 978-1-68558-262-3
Location: Valencia, Spain
Dates: from April 6, 2025 to April 10, 2025