Apple: Our AI data was gathered responsibly
Apple is keen to quell concerns over the source of the data used to build its AI models
Apple has laid out how it trained and developed the AI models powering the Apple Intelligence tools set to arrive with iOS 18, stressing the work was done responsibly amid complaints about data snatching to train generative AI models.
Leading AI developers have been accused of dodgy tactics when it comes to harvesting data for developing and training large language models (LLMs), prompting a series of lawsuits from newspapers and authors targeting OpenAI and other companies for copyright infringement.
Last month, Apple announced it would be using its own AI models alongside OpenAI's technology to power generative AI tools in the next version of its mobile operating system, iOS 18, due out this autumn.
The first look at that system was unveiled at the 2024 Worldwide Developers Conference (WWDC) in a developer beta for iOS 18.1, suggesting the AI tools may not be immediately available on iOS 18.
Apple’s responsible AI work
At WWDC, Apple also released a report detailing how two of its foundation language models for Apple Intelligence work — and beyond that, laying out the case for why its AI is responsible with regard to data use. The two models in question — Apple Foundation Model (AFM)-on-device, and AFM-server — are part of a larger suite of generative models created by Apple.
"Apple Intelligence is designed with Apple’s core values at every step and built on a foundation of industry-lead privacy protection," the document states. "Additionally, we have created Responsible AI principles to guide how we develop AI tools, as well as the models that underpin them."
Those principles guide Apple to identify where AI can be used responsibly to help users; avoid bias; take care to avoid harm at all stages of design and development; and protect user privacy.
Get the ITPro. daily newsletter
Receive our latest news, industry updates, featured resources and more. Sign up today to receive our FREE report on AI cyber crime & security - newly updated for 2024.
"Apple Intelligence is developed responsibly and designed with care to empower our users, represent them authentically, and protect their privacy," Apple said via the report.
In practice, that means Apple methodically tests its AI against a list of possible concerns, such as hate speech and graphic violence when it comes to model training, guardrail development, evaluation, and data collection, the company said.
Where's the data from?
Apple said AFM pre-training data came from "a diverse and high quality data mixture" — and includes no private Apple user data.
"This includes data we have licensed from publishers, curated publicly available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot," the report said. "We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives."
That comes as AI developers including Anthropic have been accused of scraping websites for content to train with despite instructions in the page — the robots.txt file that Apple references — that ask them not to. While the practice isn't necessarily illegal, it can breach copyright and cause disruption and high costs to websites.
Meanwhile, Apple's reference to publicly available datasets could include "The Pile". Apple is one of several companies accused of using YouTube video subtitles in its training after reportedly using a publicly available research dataset known as the "the Pile" to train its OpenELM model.
That suggests that even with such responsible data gathering principles in place, data from less than suitable sources could still be used by Apple.
Beyond text, Apple pulls in code sourced from GitHub, stressing it uses only code data with licenses that allow such reuse — clearly attempting to avoid a lawsuit like that faced by Microsoft and GitHub over the use of open-source software.
Synthetic data is also used in the post-training phase, in particular with coding, math, and tools.
Apple added it cleans up all data — "extensive efforts have been made to exclude profanity, unsafe material, and personally identifiable information" — regardless of the source, be it public datasets or licensed from publishers.
Is Apple the ethical AI alternative?
It's clear that Apple is attempting to set out its stall as the responsible AI provider, just as it has done with privacy on mobile.
That could be a wise move. Research suggests a distrust of generative AI and the companies that make the systems. Plus, the ongoing legal wrangling around data could prove a challenge to rival systems, especially if models need to be reworked in the future to avoid using specific datasets.
The ethical effort could pay off with better results in the long run, if the tech axiom "garbage in, garbage out" proves true with LLMs.
"We find that data quality, much more so than quantity, is the key determining factor of downstream model performance," Apple said in the report.