The world's 'first AI software engineer' isn't living up to expectations: Cognition AI's 'Devin' assistant was touted as a game changer for developers, but so far it's fumbling tasks and struggling to compete with human workers

Devin AI concept image showing a cartoon human head with open top with cogs floating out.
(Image credit: Getty Images)

Devin, a coding assistant hailed as the world’s 'first AI software engineer’, was given 20 coding tasks – it managed to complete just three, taking longer than expected and going down strange routes to achieve its goals.

The AI coding tool, developed by Cognition AI, was hailed as a transformative solution to help streamline software development when it was unveiled last year.

Costing around $500 per month, the AI assistant works via Slack so it feels like chatting to a colleague. At the time, Cognition showed a demo of Devin picking up jobs on Upwork, a freelancing platform that is used by software engineers to find work.

However, the results haven't been replicable by third-party researchers, according to reports, with one software developer picking apart the Upwork claims and AI researchers assessing Devin found it lacking.

Devin was framed as a game changer AI tool

At Devin's launch last year, Cognition claimed that the tool could "make money taking on messy Upwork tasks," sharing a video purporting to show just that.

But software developer Carl Brown posted his own video in response, arguing that the company was not telling the truth about the tool's abilities, revealing what "Devin was supposed to do, what it actually managed to do instead, and how bad of a job that it did."

Brown noted that it took 36 minutes to do the task himself, and six hours for Devin to fail to do it.

Cognition's claims about Devin were also tested by a team of researchers at Answer.AI, and their results were closer to Brown's than what the original blog post claimed, achieving only three of 20 tasks.

There were some "early wins", however. Devin could pull a Notion database into Google Sheets with "surprising competence", they noted, completing the task in an hour with only a few minutes of human interaction.

The code worked, but was "a bit verbose." Another task, building a planet tracker, was similarly successful.

"This felt like a glimpse into the future — an AI that could handle the 'glue code' tasks that consume so much developer time.

More complicated tasks started to raise challenges, or as the researchers said: "as we scaled up our testing, cracks appeared."

"Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions," they noted. "Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible."

Over a month, they tasked Devin with creating new projects from scratch, performing research and analyzing or modifying existing projects, but out of 20 such tasks, just three were successful.

"The most frustrating aspect wasn’t the failures themselves - all tools have limitations - but rather how much time we spent trying to salvage these attempts," they said.

How to use Devin

That's a far cry from what was advertised when the AI assistant was first unveiled in March of last year. A blog post on Cognition's website claimed Devin could take on basic tasks for software engineers, allowing them to focus on bigger problems.

The website says Devin can find and fix bugs, build and deploy an entire app end-to-end, and even train and fine-tune an AI model.

"With our advances in long-term reasoning and planning, Devin can plan and execute complex engineering tasks requiring thousands of decisions," the company said. "Devin can recall relevant context at every step, learn over time, and fix mistakes."

Cognition hasn't yet replied to a request for comment from ITPro, but its own blog post does give some context to how the system could be used more successfully than these tests suggest.

The company says Devin "can be an all-purpose tool", but recommends starting with smaller tasks such as simple bugs. Notably, the company said that it works best when you "give Devin tasks that you know how to do yourself" and tell the tool how to test or check its own work.

Thereafter, Devin can prove beneficial in helping to break down large tasks into smaller ones that will take less than three hours.

Given Answer.AI's success using Devin for smaller "glue code" tasks, perhaps such advice about starting small should be heeded.

Indeed, this research challenging the usefulness of the current crop of AI software assistants comes as Meta founder Mark Zuckerberg has predicted that AI will be doing the work of mid-level engineers this year — but with some serious caveats.

"In the beginning it’ll be really expensive to run, then you can get it to be more efficient and then over time we’ll get to the point where a lot of the code in our apps and including the AI that we generate is actually going to be built by AI engineers instead of people engineers," he said.