Enterprises beware, your LLM servers could be exposing sensitive data

Abstract technology image of running program code on digital data wave.

(Image credit: Getty Images)

published 2 September 2024

Publicly accessible AI platforms may be exposing your corporate data on the internet, new research warns.

Legit Security recently published an investigation into security issues affecting the infrastructure underpinning many businesses’ AI applications, suggesting these systems could also be susceptible to data leakage and data poisoning.

The research highlighted risks associated with two popular types of publicly accessible AI services: vector databases and LLM tools.

Vector databases used to store unstructured data for AI applications, allowing the systems to search the database according to similarity instead of exact matches.

Researchers found these databases often lack basic security guardrails, highlighting a number of cases where they found publicly accessible instances allowing anonymous access with no permission enforcement.

This meant that anyone with network access to the server would be able to read sensitive data inside, including metadata, as well as the embeddings – the numerical representations of words, images, or videos used by LLMs.

These embeddings could be used by attackers to reverse engineer the transformer used by the model to recover input data, according to research from the Hong Kong University of Science and Technology.

These systems are also vulnerable to data poisoning attacks, the report noted, whereby attackers alter databases so it changes the behavior of the AI applications built on that dataset.

Legit Security offered a number of examples of how this attack could unfold. For example, hackers could modify a vector database and make a client-facing chatbot instruct customers to download and install malware on their devices.

Another example suggested chatbots used for medical consultations with access to historical patient data could be manipulated into providing false or dangerous advice.

In addition, researchers warned the software installed on servers hosting these vector databases contain vulnerabilities that attackers could potentially exploit to exfiltrate or poison the data they contain.

Almost half of scanned Flowise servers vulnerable to “simple authentication bypass vulnerability”

Legit Security’s investigation used scanning tools to identify a number of publicly accessible vector database instances, checking for required authentication and whether it was possible to extract data from the system.

The researchers found around 30 servers with evidence of corporate or private data, such as private company emails, customer PII, financial records, and prospect resumes and contact information.

It found some of these servers were susceptible to data poisoning, including one storing patient information for a medical chatbot, company Q&A data, and a real estate agency’s property data.

In each of these cases, the attackers would not have needed to exploit a vulnerability or use specific tools to read the data, and were able to use the REST-API or Web UI to modify or delete data.

RELATED WHITEPAPER

Ensure that applications run with optimal performance

In addition to vector databases, the report found publicly exposed LLM tools suffered from a similar lack of security layers, highlighting one particular tool – Flowise – a low-code LLM automation service.

Tools like Flowise have access to a wide range of sensitive data including private company information , application configurations, and prompts.

Many businesses integrate these tools with external services like the OpenAI API, AWS Bedrock, Confluence, or GitHub, the report noted, meaning any credential leakage related to the integrations could lead to an even wider breach upstream.

Researchers scanned for public instances of Flowise servers and found the majority were password-protected, but warned it found a number of “simple vulnerabilities” in early versions of the platform.

For example, 45% of the 959 servers assessed by the researchers were found to be vulnerable to an authentication bypass vulnerability (CVE-2024-31621).

Moreover, scanning the data in the servers it found a “couple dozen secrets”, including OpenAI API keys, GitHub access tokens, URLs with database passwords, as well as API keys for Pinecone vector databases.

Businesses should reevaluate what platforms their developers are using, Legit Security advised, and should immediately implement a strict permissions system to prevent anonymous access.

“If possible, do not publicly expose these services, and manage access through private networks,” the report added.

Another precaution the report suggested is to ensure that any client PII and other sensitive information are removed from the data used by their AI services, to avoid potentially costly data leakage.

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.