Researchers tested over 100 leading AI models on coding tasks — nearly half produced glaring security flaws

Female software developer coding in dark room with screen reflecting on glasses.

(Image credit: Getty Images)

Just 55% of code generated with AI is free of known cybersecurity vulnerabilities, according to new research from Veracode.

To test the capability of AI models to generate safe code, Veracode took existing functions and replaced part of the code with a comment describing what the finished code should look like.

In 45% of results, generated code contained known security flaws, with no significant difference in outcome between small models and the largest available.

The findings underline a major potential risk attached to ‘vibe coding’, in which software developers rely heavily on large language model (LLM) output to quickly generate code for use in software.

Researchers put over 100 LLMs across a variety of vendors, sizes, and intended applications – including models specifically intended for coding as well as general purpose models – through 80 distinct coding tasks.

Veracode said researchers intentionally used sections that could be coded in a number of different ‘correct’ ways, as well as in at least one way that would include a known software vulnerability or ‘Common Weakness Enumeration’ (CWE).

These CWEs included flaws that hackers could use for SQL injection, cross-site scripting (XSS), cracking cryptographic algorithms, and log injection attacks. Each featured vulnerability is in the Open Worldwide Application Security Project (OWASP) list of top ten vulnerabilities.

Models showed inconsistent performance across different vulnerability types, achieving security pass rates of 85.6% and 80.4% when it came to avoiding inclusion of the cryptographic algorithm and SQL injection vulnerabilities.

In contrast, models fared extremely poorly with avoiding the XSS and log injection vulnerabilities, achieving an average 13.5% and 12% respectively.

Researchers noted that the tested LLMs are getting better still at avoiding the SQL injection and cryptographic algorithm flaws over time, while seemingly getting worse at avoiding the XSS and log injection vulnerabilities.

Overall, Veracode noted that the security improvements of the tested LLMs have flatlined.

The authors of the report noted that it is possible to phrase AI code prompts in a more security-conscious way, but that this is far from standard practice. With this in mind, they intentionally short prompts, to examine how models react when given minimal context.

But they also warned that even if firms take a more security-aware approach to code generation, LLMs are still prone to errors such as which variables require sanitization, a necessary step for preventing code injection attacks.

“Even with a large context window, it is unclear whether models can perform the detailed interprocedural dataflow analysis required to determine this information precisely,” they wrote.

LLMs were tested across a range of programming languages: Python, C#, JavaScript, and Java. Overall, the researchers found LLMs the worst at generating Java safely, achieving an average score of 28.5% in this widely-used language.

AI-generated code remains a concern, but adoption is still rising

AI tools are now widely used for generating code, with 84% of software developers using AI to produce code more quickly according to recent Stack Overflow findings.

But the same report underlined continued distrust among developers in the quality of AI code, with three-quarters (75.3%) reporting that they do not trust AI outputs and 61.7% stating they have security concerns over the use of AI code.

Despite these worries, big tech continues to embrace AI code, with Alphabet CEO Sundar Pichai having revealed last year that 25% of Google’s internal code is now AI-generated and Microsoft CEO Satya Nadella recently revealing up to 20-30% of his firm’s code was written by AI.

Nadella noted that while Microsoft has been quick to adopt AI-generated Python code, C++ has proven harder to adopt. Kevin Scott, CTO at Microsoft, has been bullish on overcoming these hurdles with his prediction that 95% of code will be AI-generated by 2030, as reported by Business Insider.

Security teams and developers will have to carefully weigh up findings such as Veracode’s against the potential benefits to their bottom line of using AI to alter and add to their codebase.

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.

In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.