Computer scientists makes noisy data: Can improve treatments in health care

16.1.2024 12:15:00 CET | Københavns Universitet - Det Natur- og Biovidenskabelige Fakultet | Pressemeddelelse

Del

University of Copenhagen researchers have developed software able to disguise sensitive data such as those used for Machine Learning in health care applications. The method protects privacy while making datasets available for development of better treatments.

A key element in modern healthcare is collecting and analyzing data for a large group of patients to discover patterns. Which patients benefit from a given treatment? And which patients are likely to experience side-effects? Such data must be protected, else the privacy of individuals is broken. Furthermore, breaches will harm general trust, leading to fewer people giving their consent to take part. Researchers at the Department of Computer Science, University of Copenhagen, have found a clever solution.

“We have seen several cases in which data was anonymized and then released to the public, and yet researchers managed to retrieve the identities of participants. Since many other sources of information exist in the public domain, an adversary with a good computer will often be able to deduct the identities even without names or citizen codes. We have developed a practical and economical way to protect datasets when used to train Machine Learning models,” says PhD student Joel Daniel Andersson.

The level of interest in the new algorithm can be illustrated by the fact, that Joel was invited to give a Google Tech Talk on it, one of the world’s most prestigious digital formats for computer science research. Also, he recently held a presentation at NeurIPS, one of the world’s leading conferences on Machine Learning with more than 10,000 participants.

Deliberately polluting your output

The key idea is to mask your dataset by adding “noise” to any output derived from it. Unlike encryption, where noise is added and later removed, in this case the noise stays. Once the noise is added, it cannot be distinguished from the “true” output.

Obviously, the owner of a dataset should not be happy about noising outputs derived from it.

“A lower utility of the dataset is the necessary price you pay for ensuring the privacy of participants,” says Joel Daniel Andersson.

The key task is to add an amount of noise sufficient to hide the original data points, but still maintain the fundamental value of the dataset, he notes:

“If the output is sufficiently noisy, then it becomes impossible to infer the value of an individual data point in the input, even if you know every other datapoint. By noising the output, we are in effect adding safety rails to the interaction between the analyst and the dataset. The analysts never access the raw data, they only ask queries about it and get noisy answers. Thereby, they never learn any information about individuals in the dataset. This protects against information leaks, inadvertent or otherwise, stemming from analysis of the data.”

Privacy comes with a price tag

There is no universal optimal trade-off, Joel Daniel Andersson underscores:

“You can pick the trade-off which fits your purpose. For applications where privacy is highly critical – for instance healthcare data – you can choose a very high level of privacy. This means adding a large amount of noise. Notably, this will sometimes imply that you will need to increase your number of datapoints – so include more persons in your survey for instance - to maintain the value of your dataset. In applications where privacy is less critical, you can choose a lower level. Thereby, you will maintain the utility of your dataset and reduce the costs involved in providing privacy.”

Reducing costs is exactly the prime argument behind the method developed by the research group, he adds:

“The crux is how much noise you must add to achieve a given level of privacy, and this is where our smooth mechanism offers an improvement over existing methods. We manage to add less noise and do so with fewer computational resources. In short, we reduce the costs associated with providing privacy."

Huge interest from industry

Machine Learning involves large datasets. For instance, in many healthcare disciplines a computer can find patterns that human experts cannot see. This all starts with training the computer on a dataset with real patient cases. Such training sets must be protected.

“Many disciplines depend increasingly on Machine Learning. Further, we see Machine Learning spreading beyond professionals like medical doctors to various private applications. These developments open a wealth of new opportunities, but also increases the need for protecting the privacy of the participants who provided the original data,” explains Joel Daniel Andersson, noting that interest in the groups’ new software is far from just academic:

“Besides the healthcare sector plus Google and other large tech companies, industry like consultants, auditing firms, and law firms need to be able to protect the privacy of their clients and participants in surveys.”

Public regulation is called for

The field is known as differential privacy. The term is derived from the fact that the privacy guarantee is for datasets differing in a single data point: output based on two datasets differing only in one data point will look similar. This makes it impossible for the analyst to identify the single data point.

The research group advocates for public bodies to take a larger interest in the field.

“Since better privacy protection comes with a higher price tag due to the loss of utility, it easily becomes a race to the bottom for market actors. Regulation should be in place, stating that a given sensitive application needs a certain minimum level of privacy. This is the real beauty of differential privacy. You can pick the level of privacy you need, and the framework will tell you exactly how much noise you will need to achieve that level,” says Joel Daniel Andersson. He hopes that differential privacy may serve to facilitate the use of Machine Learning:

“If we again take medical surveys as an example, they require patients giving consent to participate. For various reasons, you will always have some patients refusing – or just forgetting – to give consent, leading to a lower value of the dataset. But since it is possible to provide a strong probabilistic guarantee that the privacy of participants will not be violated, it could be morally defensible to not require consent and achieve 100 % participation to the benefit of the medical research. If the increase in participation is large enough, the loss in utility from providing privacy could be more than offset by the increased utility from the additional data. As such, differential privacy could become a win-win for society.”

The scientific article presenting the new method “A Smooth Binary Mechanism for Efficient Private Continual Observation” can be found here.

Contact:

Joel Daniel Andersson
PhD Student
Department of Computer Science (DIKU)
University of Copenhagen
jda@di.ku.dk
+46 73 08 72 712.

Michael Skov Jensen
Journalist and team coordinator
The Faculty of Science
University of Copenhagen
Mobile: + 45 93 56 58 97
msj@science.ku.dk

Følg pressemeddelelser fra Københavns Universitet - Det Natur- og Biovidenskabelige Fakultet

Skriv dig op her, og modtag pressemeddelelser på e-mail. Indtast din e-mail, klik på abonner, og følg instruktionerne i den udsendte e-mail.

Flere pressemeddelelser fra Københavns Universitet - Det Natur- og Biovidenskabelige Fakultet

Ny institutleder på IFRO: ”Faglighed og fællesskab går hånd i hånd”1.7.2025 10:49:17 CEST | Pressemeddelelse

Per Svejstrup er fra 1. august ansat som institutleder på Institut for Fødevare- og Ressourceøkonomi (IFRO). Den kommende leder træder ind i rollen med stor respekt for IFRO's faglige og kollegiale kultur med klare ambitioner for fremtiden.

Dangerous Variant of Salmonella Still Not Eradicated – Researchers Point to the Solutions1.7.2025 09:53:23 CEST | Press release

The infectious and multi-resistant cattle disease Salmonella Dublin can be fatal to both humans and animals and causes significant losses for farmers. Although Denmark has attempted to eradicate the disease since 2008, it has not yet succeeded. A study from the University of Copenhagen points to possible reasons – and the necessary solutions.

Farlig type salmonella er stadig ikke nedkæmpet i Danmark – forskere peger på løsningerne30.6.2025 09:54:03 CEST | Pressemeddelelse

Den smitsomme og multiresistente kvægsygdom Salmonella Dublin kan være dødelig for både mennesker og dyr og medfører desuden betydelige tab for landmændene. Selvom Danmark har forsøgt at udrydde sygdommen siden 2008, er det ikke lykkedes. Et studie fra Københavns Universitet peger på den mulige årsag og de nødvendige løsninger.

Her er de blomster som bier og mennesker bedst kan lide27.6.2025 07:12:14 CEST | Pressemeddelelse

Botanikere fra Københavns Universitet og Storbritannien satte sig for at finde de bedste blomsterblandinger til bier og svirrefluer. Resultaterne gør det lettere for bl.a. haveejere og kommuner at plante de perfekte spisekamre for insekterne, som samtidig fryder det menneskelige øje.

Nyt dansk forskningscenter skal skabe designede proteiner med kæmpe potentiale25.6.2025 08:00:00 CEST | Pressemeddelelse

Designede proteiner forventes at få en banebrydende effekt på en lang række områder, fra behandling af sygdomme til håndtering af miljøproblemer. Med en bevilling på 700 mio. kr. fra Novo Nordisk Fonden og under ledelse af professor Dek Woolfson har det nye Center for Protein Design (CPD) på Københavns Universitet ambitioner, der matcher potentialet. CPD vil sætte sig i spidsen for udviklingen af proteindesign og områdets anvendelsesmuligheder gennem tværfagligt samarbejde på Københavns Universitet og partnerskaber i Danmark og udlandet.

I vores nyhedsrum kan du læse alle vores pressemeddelelser, tilgå materiale i form af billeder og dokumenter samt finde vores kontaktoplysninger.

Besøg vores nyhedsrum