Solving the data dilemma: Balancing AI innovation with ethics

Blue scales with computer coding terms — (Image credit: Getty Images)

The current landscape between websites and AI companies is fraught with conflict. Disney launched a legal battle with AI company Midjourney in 2025, and the BBC threatened AI firms with legal action.

This ongoing friction is likely to change the way we use the internet, making it more difficult to access information as sites lock away content. However, access to public web data is important. It allows companies globally to develop tools for consumers to easily compare prices, for example.

Public web data is also vital for channel partners who use it to optimize marketing campaigns, verify ads, and track the price fluctuations of competitors, among other use cases. As a result, we need to ensure public web data stays just that, public and equally accessible for innovation, and that the open internet can continue.

An outline of web scraping practices

Multiple industries need public web intelligence, such as AI, ecommerce, marketing, finance, and cybersecurity. They utilize proxy IPs and public data collection solutions to access it.

With these tools, businesses can compete to offer the lowest prices for consumers. Meanwhile, cybersecurity experts use proxies to collect threat intelligence only accessible from specific locations. Many universities and NGOs use proxies for their research and to track propaganda or disinformation.

For years, e-commerce has used web intelligence to compare the prices of products against competitors. E-commerce companies also track price and inventory changes ahead of the holiday season or a specific promotion. For this reason, closing off public data access would be detrimental to channel partners who depend on intelligence gathered from the web.

The role of public data in Google search

Web scraping is almost as old as the internet itself, with Google being the biggest and best-known scraper. Originally, Google made the internet usable. But, as the internet has evolved, the way they’re interacting with public data has changed. In the age of AI, Google visits every new website, clicks on each link, gathers all available information, and stores this in its vast datacenters.

The data is then processed and indexed so that whenever you need to search for something, you simply enter specific keywords and Google will display the top websites with the desired content.

Google can do this not only because it has previously visited these websites and gathered the content, but also because the owners of those websites are happy with this process. They want Google to list them, as it increases the chances of new visitors clicking on their website.

The rise of anti-scraping measures in the age of AI

In recent years, a new player has entered the market with huge resources, impacting the whole industry and disrupting the data ecosystem. This new player is, of course, AI. According to live polls during OxyCon sessions, 57% of respondents reported that public web-scraped data remains their main source for training AI models.

With the sudden surge of AI development leading to complicated legal battles on both sides regarding training data, there are no clear rules for compliance. This is evident in the legal battle we saw X take on recently. X did not want its data being used for LLM training and, as a result, they started blocking the traffic and using their lawyers to deter organizations from gathering web intelligence from their site.

However, since then, they have lost two legal cases where the judge argued that content on X generated by users and accessible without login is, in actuality, public data - and therefore does not belong exclusively to X.

These legal battles are occurring because the landscape is unclear and difficult to navigate. Unsurprisingly, Europe is putting regulations in place the fastest; however, this has not provided a solution, and many consider Europe’s recent regulations to be unclear and too strict. Truthfully, no one in the region has a clear, confident understanding of how to comply.

Meanwhile, the US has passed 280-plus pieces of legislation in the past 12 months, while Australia is considering an entirely new approach with a focus on innovation. While AI is developing fast, it’s not apparent what needs to be regulated first. Naturally, by the time a new regulation is in place, the AI will have changed so much that it’s difficult to catch up and stay relevant.

The end of commerce as we know it

At this point, it may seem like AI is going to lead every website to block each other to keep hold of their data. However, the huge popularity of AI suggests that this is not the best route, especially as more and more users are opting to use ChatGPT as a form of ‘search engine’ before making final purchase decisions.

Once someone has used an LLM as a product research or price comparison tool, they are far more likely, - four times by some estimates - to make their purchase, because the research is done. As an enterprise, if you block AI agents from scraping data from your website, your product will not be reflected in the LLM results given to consumers, and, unsurprisingly, the sales will plummet.

Web intelligence is critical for the whole digital ecosystem, allowing the building of automations and solutions with consumers in mind. However, the ongoing data wars halt innovation and cause some actors to be excluded. In this scenario, it’s crucial to find a solution for how to access data without damaging the equal opportunities of others

We’re living in truly fascinating times where technology that was created 10 years ago feels ancient now. If you’re playing by the rules, open data access helps businesses to create innovative solutions. For this reason, it’s important that public web data stays open and equally available for all to utilize.

Vaidotas is the chief risk officer at Oxylabs, a market-leading web intelligence collection platform.

Having over 10 years of experience in payment and digital risk management, Vaidotas has established himself as an influential force in the web data gathering industry, employing innovative methods to ensure the most ethical and secure SaaS business processes.

Before coming to Oxylabs, Vaidotas spent seven years at Western Union, working as a risk analyst and, later, leading digital risks and digital payments teams.

Currently, Vaidotas is leading a team of 17 professionals that is successfully overseeing risk-vulnerable areas of business operations and countering emerging threats.