12,000 API keys and passwords were found in a popular AI training dataset – experts say the issue is down to poor identity management
Increasingly complex identity management is driving complacency among devs and leading to hardcoded secrets exposing API keys


The discovery of almost 12,000 valid secrets in the archive of a popular AI training dataset is the result of the industry’s inability to keep up with the complexities of identity management, experts have told ITPro.
Researchers at Truffle Security found nearly 12,000 ‘live’ API keys and passwords when analysing the Common Crawl archive used to train open source LLMs such as DeepSeek
The researchers trawled through the December 2024 Common Crawl archive, consisting of 400TB of web data gathered from 2.67 billion web pages, and found 11,908 live secrets using their open source secret scanner, TruffleHog.
The report found these secrets had been hardcoded in the front-end HTML and JavaScript, rather than using server-side environment variables.
In total, TruffleHog found 219 different secret types in the archive including API keys for AWS and Walkscore.
Mailchimp API keys were the most frequently leaked secret, however, with the researchers finding 1,500 unique keys hardcoded into HTML forms and JavaScript snippets.
The report warned that this exposure of LLMs to examples of code containing hardcoded secrets could lead to them suggesting these secrets in their model outputs, although it noted fine-tuning, alignment techniques, prompt context, and alternative training data can mitigate this risk.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Nonetheless, malicious actors could use the keys for phishing campaigns, data exfiltration, and brand impersonation, researchers said.
Industry on an “unsustainable path for growing infrastructure complexity”
IT leaders have warned that an increasingly complex technology landscape, combined with an ever expanding number of machine identities for organizations to manage, is a major factor in why these secrets have been exposed.
As developers struggle to manage complex machine identities, human errors like hardcoding secrets become much more common, leading to them turning up in AI training data scraped by web crawlers as in the Common Crawl case.
Speaking to ITPro, Darren Meyer, security research advocate at Checkmarx, suggested this problem has been around for a while and is only set to get worse as organizations increase the number of machine identities they need to manage by adopting new technologies.
“This problem of leaking credentials and related secrets because of machine-to-machine authentication requirements is a long-standing and growing issue,” he said.
“New use cases like training AI models on otherwise private data will definitely increase the likelihood that secrets leak, as well as the impact of those leaks.”
Ev Kontsevoy, CEO at Teleport, added he was not surprised by these findings and that at the current rate the industry is on an “unsustainable path for growing infrastructure complexity”. Kontsevoy further warned that this will continue to happen unless the industry changes its understanding of identity.
"It’s never surprising seeing that secrets like APIs keys are making their way into places they shouldn’t be. We are on an unsustainable path for growing infrastructure complexity that will continuously expose secrets and waste the productivity of engineers, unless we rethink our approach to identity and security,” he argued.
“Every emerging technology being brought into production is on one hand critical for businesses to stay competitive – because your competitors are adopting that tech as well – but on the other, it represents yet another attack vector,” Kontsevoy added.
“Every single layer of a technology listening on the network has its own idea of users, its own role-based access control, its own configuration and configuration syntax. That requires expertise, which most teams today lack to secure every little thing they have, and yet the future keeps bringing new things they need to secure.”
Organizations have their work cut out for them
Meyer said addressing this problem will not be easy and organizations have two relatively stark challenges ahead of them to avoid exposing their secrets, whether through AI training data or otherwise.
“Organizations need to do two relatively challenging things. Firstly, organisations should be seeking to avoid using long-life secrets for machine-to-machine authentication, replacing such systems with OIDC or other similar systems that use short-lived tokens wherever possible,” he told ITPro.
“This reduces the impact of a secrets leak, as the leaked secrets are much more likely to have expired by the time an attacker gets hold of them, making them useless."
“Secondly, they should have strong processes around AI adoption to ensure that AI agents and related systems don’t have access to sensitive data in most cases. This type of control has to happen at every stage, from alerting about secrets being leaked during development to carefully monitoring the data being fed into AI models during training and operation.”
He added that AI agents that require varying levels of access to potentially sensitive areas of their IT environment will introduce further identity management challenges.
“In cases where the purpose of the AI agent requires access to secrets or other sensitive data, those adoption processes should ensure that access to the model and any implementing applications is tightly controlled.”
MORE FROM ITPRO
Solomon Klappholz is a former Staff Writer at ITPro adn ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing which led to him developing a particular interest in IT regulation, industrial infrastructure applications, and machine learning.
-
Australian pension funds slammed for ‘absolute incompetence’ in wake of cyber attacks
News While firms are working to minimize the damage, Super Consumers Australia said the attack showed 'absolute incompetence'
By Emma Woollacott Published
-
Hackers are targeting Ivanti VPN users again – here’s what you need to know
News Ivanti has re-patched a security flaw in its Connect Secure VPN appliances that's been exploited by a China-linked espionage group since at least the middle of March.
By Emma Woollacott Published
-
Security experts issue warning over the rise of 'gray bot' AI web scrapers
News While not malicious, the bots can overwhelm web applications in a way similar to bad actors
By Jane McCallion Published
-
Law enforcement needs to fight fire with fire on AI threats
News UK law enforcement agencies have been urged to employ a more proactive approach to AI-related cyber crime as threats posed by the technology accelerate.
By Emma Woollacott Published
-
OpenAI announces five-fold increase in bug bounty reward
News OpenAI has announced a slew of new cybersecurity initiatives, including a 500% increase to the maximum award for its bug bounty program.
By Jane McCallion Published
-
Hackers are turning to AI tools to reverse engineer millions of apps – and it’s causing havoc for security professionals
News A marked surge in attacks on client-side apps could be due to the growing use of AI tools among cyber criminals, according to new research.
By Emma Woollacott Published
-
Multichannel attacks are becoming a serious threat for enterprises – and AI is fueling the surge
News Organizations are seeing a steep rise in multichannel attacks fueled in part by an uptick in AI cyber crime, new research from SoSafe has found.
By George Fitzmaurice Published
-
Microsoft is increasing payouts for its Copilot bug bounty program
News Microsoft has expanded the bug bounty program for its Copilot lineup, boosting payouts and adding coverage of WhatsApp and Telegram tools.
By Nicole Kobie Published
-
Tech leaders worry AI innovation is outpacing governance
News Business execs have warned the current rate of AI innovation is outpacing governance practices.
By Emma Woollacott Published
-
Cisco is jailbreaking AI models so you don’t have to worry about it
News Cisco's new AI Defense security solution helps organizations shore up LLM security by identifying potential flaws.
By Solomon Klappholz Published