IBM unveils world-first machine learning training method for GDPR-compliance
The novel approach to training machine learning models minimises the amount of personal data that's required to make accurate predictions


IBM researchers have unveiled a novel method of training machine learning (ML) models that minimises the amount of personal data required and preserves high levels of accuracy.
The research is thought to be a boon to businesses that need to stay compliant with data protection and data privacy laws such as the General Data Protection Regulation (GDPR) and the California Privacy Rights Act (CPRA).
What is machine learning and why is it important? General Data Protection Regulation (GDPR) 11 best machine learning courses
In both GDPR and CPRA, 'data minimisation' is a core component of the legislation but it's been difficult for companies to determine what the minimal amount of personal data should be when training ML models.
It's especially difficult when the goal of training ML models is usually to achieve the highest degree of accuracy in predictions or classifications, regardless of the amount of data used.
The findings from the study, thought to be a world-first development in the field of machine learning, showed that fewer data could be used in training datasets by undergoing a process of generalisation while preserving the same level of accuracy compared to larger ones.
At no point did researchers see a drop in prediction accuracy below 33% even when the entire dataset was generalised, preserving none of the original data. In some cases, the researchers were able to achieve 100% accuracy even with some generalisation.
In addition to adhering to the data minimisation principle of major data protection laws, researchers suggest that smaller data requirements could also lead to reduced costs in areas like data storage and management fees.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
RELATED RESOURCE
Data generalisation process
Businesses can become more compliant with data laws by removing or generalising some of the input features of runtime data, IBM researchers showed.
Generalisation involves taking a feature value and breaking it down into specific values and generalised values. For a numerical feature 'age', the specific values of which could be 37 or 39, a possible generalised value range could be 36-40.
A categorical feature of 'marital status' could have the specific values 'married, 'never married', and 'divorced'. A generalisation of these could be 'never married' and 'divorced' which eliminates one value, decreasing specificity, but still provides a degree of accuracy as 'divorced' implies that an individual has, at one point, been married.
The numerical features are less specific, adding three additional values, while the categorical feature is less detailed. The quality of these generalisations is then analysed using a metric. IBM chose to use the NCP metric over others in consideration as it lent itself best to the purposes of data privacy.
Researchers then selected a dataset and trained one or more target models on it to create a baseline. Generalisation was then applied, the accuracy was calculated and re-calculated (see diagram above) until the final generalisation was ready to be compared to the baseline.
The accuracy of the target model is calculated using decision trees (see above) which are gradually trimmed from the bottom upwards, taking note of any significant decreases in accuracy.
If accuracy is maintained or meets the acceptable threshold after generalised data is applied, the researchers then work to improve the generalisation by gradually trimming the decision tree from the bottom upwards, increasing the generalised range of a given feature, until the final optimised generalisation is made.

Connor Jones has been at the forefront of global cyber security news coverage for the past few years, breaking developments on major stories such as LockBit’s ransomware attack on Royal Mail International, and many others. He has also made sporadic appearances on the ITPro Podcast discussing topics from home desk setups all the way to hacking systems using prosthetic limbs. He has a master’s degree in Magazine Journalism from the University of Sheffield, and has previously written for the likes of Red Bull Esports and UNILAD tech during his career that started in 2015.
-
Asus ZenScreen Fold OLED MQ17QH review
Reviews A stunning foldable 17.3in OLED display – but it's too expensive to be anything more than a thrilling tech demo
By Sasha Muller
-
How the UK MoJ achieved secure networks for prisons and offices with Palo Alto Networks
Case study Adopting zero trust is a necessity when your own users are trying to launch cyber attacks
By Rory Bathgate
-
Rising data breach costs show no signs of slowing down, says IBM
News Data breach costs continued to rise, according to IBM, and they’re taking longer to recover from
By Solomon Klappholz
-
Nearly 70 software vendors sign up to CISA’s cyber resilience program
News Major software manufacturers pledge to a voluntary framework aimed at boosting cyber resilience of customers across the US
By Solomon Klappholz
-
IBM: Data governance for data-driven organizations
whitepaper Master your data management
By ITPro
-
KuppingerCole leadership compass report - Unified endpoint management (UEM) 2023
Whitepaper Get an updated overview of vendors and their product offerings in the UEM market.
By ITPro
-
PowerEdge - Cyber resilient infrastructure for a Zero Trust world
Whitepaper Combat threats with an in-depth security stance focused on data security
By ITPro
-
Anticipate, prevent, and minimize the impact of business disruptions
Whitepaper Nine best practices for building operational resilience
By ITPro
-
Three steps to transforming security operations
Whitepaper How to be more agile, effective, collaborative, and scalable
By ITPro
-
Top ten ways to anticipate, eliminate, and defeat cyber threats like a boss
Whitepaper Improve your cyber resilience and vulnerability management while speeding up response times
By ITPro