Computer scientists makes noisy data: Can improve treatments in health care
University of Copenhagen researchers have developed software able to disguise sensitive data such as those used for Machine Learning in health care applications. The method protects privacy while making datasets available for development of better treatments.

A key element in modern healthcare is collecting and analyzing data for a large group of patients to discover patterns. Which patients benefit from a given treatment? And which patients are likely to experience side-effects? Such data must be protected, else the privacy of individuals is broken. Furthermore, breaches will harm general trust, leading to fewer people giving their consent to take part. Researchers at the Department of Computer Science, University of Copenhagen, have found a clever solution.
“We have seen several cases in which data was anonymized and then released to the public, and yet researchers managed to retrieve the identities of participants. Since many other sources of information exist in the public domain, an adversary with a good computer will often be able to deduct the identities even without names or citizen codes. We have developed a practical and economical way to protect datasets when used to train Machine Learning models,” says PhD student Joel Daniel Andersson.
The level of interest in the new algorithm can be illustrated by the fact, that Joel was invited to give a Google Tech Talk on it, one of the world’s most prestigious digital formats for computer science research. Also, he recently held a presentation at NeurIPS, one of the world’s leading conferences on Machine Learning with more than 10,000 participants.
Deliberately polluting your output
The key idea is to mask your dataset by adding “noise” to any output derived from it. Unlike encryption, where noise is added and later removed, in this case the noise stays. Once the noise is added, it cannot be distinguished from the “true” output.
Obviously, the owner of a dataset should not be happy about noising outputs derived from it.
“A lower utility of the dataset is the necessary price you pay for ensuring the privacy of participants,” says Joel Daniel Andersson.
The key task is to add an amount of noise sufficient to hide the original data points, but still maintain the fundamental value of the dataset, he notes:
“If the output is sufficiently noisy, then it becomes impossible to infer the value of an individual data point in the input, even if you know every other datapoint. By noising the output, we are in effect adding safety rails to the interaction between the analyst and the dataset. The analysts never access the raw data, they only ask queries about it and get noisy answers. Thereby, they never learn any information about individuals in the dataset. This protects against information leaks, inadvertent or otherwise, stemming from analysis of the data.”
Privacy comes with a price tag
There is no universal optimal trade-off, Joel Daniel Andersson underscores:
“You can pick the trade-off which fits your purpose. For applications where privacy is highly critical – for instance healthcare data – you can choose a very high level of privacy. This means adding a large amount of noise. Notably, this will sometimes imply that you will need to increase your number of datapoints – so include more persons in your survey for instance - to maintain the value of your dataset. In applications where privacy is less critical, you can choose a lower level. Thereby, you will maintain the utility of your dataset and reduce the costs involved in providing privacy.”
Reducing costs is exactly the prime argument behind the method developed by the research group, he adds:
“The crux is how much noise you must add to achieve a given level of privacy, and this is where our smooth mechanism offers an improvement over existing methods. We manage to add less noise and do so with fewer computational resources. In short, we reduce the costs associated with providing privacy."
Huge interest from industry
Machine Learning involves large datasets. For instance, in many healthcare disciplines a computer can find patterns that human experts cannot see. This all starts with training the computer on a dataset with real patient cases. Such training sets must be protected.
“Many disciplines depend increasingly on Machine Learning. Further, we see Machine Learning spreading beyond professionals like medical doctors to various private applications. These developments open a wealth of new opportunities, but also increases the need for protecting the privacy of the participants who provided the original data,” explains Joel Daniel Andersson, noting that interest in the groups’ new software is far from just academic:
“Besides the healthcare sector plus Google and other large tech companies, industry like consultants, auditing firms, and law firms need to be able to protect the privacy of their clients and participants in surveys.”
Public regulation is called for
The field is known as differential privacy. The term is derived from the fact that the privacy guarantee is for datasets differing in a single data point: output based on two datasets differing only in one data point will look similar. This makes it impossible for the analyst to identify the single data point.
The research group advocates for public bodies to take a larger interest in the field.
“Since better privacy protection comes with a higher price tag due to the loss of utility, it easily becomes a race to the bottom for market actors. Regulation should be in place, stating that a given sensitive application needs a certain minimum level of privacy. This is the real beauty of differential privacy. You can pick the level of privacy you need, and the framework will tell you exactly how much noise you will need to achieve that level,” says Joel Daniel Andersson. He hopes that differential privacy may serve to facilitate the use of Machine Learning:
“If we again take medical surveys as an example, they require patients giving consent to participate. For various reasons, you will always have some patients refusing – or just forgetting – to give consent, leading to a lower value of the dataset. But since it is possible to provide a strong probabilistic guarantee that the privacy of participants will not be violated, it could be morally defensible to not require consent and achieve 100 % participation to the benefit of the medical research. If the increase in participation is large enough, the loss in utility from providing privacy could be more than offset by the increased utility from the additional data. As such, differential privacy could become a win-win for society.”
The scientific article presenting the new method “A Smooth Binary Mechanism for Efficient Private Continual Observation” can be found here.
Contact:
Joel Daniel Andersson
PhD Student
Department of Computer Science (DIKU)
University of Copenhagen
jda@di.ku.dk
+46 73 08 72 712.
Michael Skov Jensen
Journalist and team coordinator
The Faculty of Science
University of Copenhagen
Mobile: + 45 93 56 58 97
msj@science.ku.dk
Følg pressemeddelelser fra Københavns Universitet
Skriv dig op her, og modtag pressemeddelelser på e-mail. Indtast din e-mail, klik på abonner, og følg instruktionerne i den udsendte e-mail.
Flere pressemeddelelser fra Københavns Universitet
Flasker gemt i 130 år vidner om dansk smøreventyr og datidens hygiejne12.9.2025 08:56:13 CEST | Pressemeddelelse
To glemte flasker i en kælder på Frederiksberg med bakteriekulturer fra 1890’erne har givet forskere fra Københavns Universitet et enestående indblik i Danmarks smøreventyr. Gennem avanceret DNA-analyse har de undersøgt indholdet i flaskerne som blandt andet bød på flere bakterielle overraskelser og en påmindelse om datidens udfordringer med hygiejnen.
Lyn, Bille og Lilje - øget klimabevidsthed kan nu ses i vores valg af navne11.9.2025 07:42:42 CEST | Pressemeddelelse
En ny undersøgelse fra Københavns Universitet viser, at vi i stigende grad giver vores børn navne, som signalerer et tæt forhold til naturen og klimaet.
Holdspil sænker blodtryk og forbedrer funktion hos patienter med kroniske sygdomme8.9.2025 15:35:37 CEST | Pressemeddelelse
Ny forskning fra Københavns Universitet viser, at holdspil er en særdeles effektiv og potentiel livsforlængende træningsform for patienter med forhøjet blodtryk og KOL. Selv efter et relativt kort træningsforløb med holdspil kan man måle et markant forbedret blodtryk hos deltagerne.
Children from homes with fewer resources have an increased risk of asthma8.9.2025 08:17:00 CEST | Press release
Children from less-advantaged socio-economic circumstances have a significantly increased risk of developing asthma. Smoking during pregnancy and breastfeeding are two of the major contributing factors to these inequalities This is stated by a new European study led by the University of Copenhagen. The researchers call for the authorities to support families better.
Børn fra hjem med færre ressourcer har forhøjet risiko for astma8.9.2025 07:16:00 CEST | Pressemeddelelse
Børn af forældre med lav uddannelse og lav indkomst har markant forhøjet risiko for at få astma. Rygning under graviditeten og amning er to af de store forklaringer på den sociale ulighed. Det fastslår et nyt europæisk studie, som Københavns Universitet står i spidsen for. Forskerne opfordrer til at myndighederne støtter familierne bedre.
I vores nyhedsrum kan du læse alle vores pressemeddelelser, tilgå materiale i form af billeder og dokumenter samt finde vores kontaktoplysninger.
Besøg vores nyhedsrum