Privacy

Research area: Statistics

Supervisors: James Liley (Michaelmas) and Louis Aslett (Epiphany)

Background

The collection of data is a pervasive aspect of contemporary life, ranging from personal devices to industrial processes. This data proliferation offers numerous advantages, such as enhanced predictive accuracy and improved decision-making capabilities, but it also introduces substantial privacy issues. Across domains like finance, education, and healthcare, sensitive personal information is collected, stored, and analysed, meaning the potential for misuse or unauthorised disclosure of this data can have extensive repercussions for individuals.

This is more than a mere theoretical concern: there have been numerous serious, real-world privacy breaches. The following are some illustrative examples, with links to further reading (click to expand). First, it has been repeatedly shown that anonymisation alone is insufficient to protect privacy:

Massachusetts Health Data De-anonymisation
In 1997, Latanya Sweeney demonstrated that “anonymised” medical records released by the Massachusetts Group Insurance Commission could be de-anonymised by cross-referencing them with public voter registration lists. She famously identified the medical records of then-Governor William Weld by matching his birth date, sex, and ZIP code. This research suggested that a very large fraction of the US population could be uniquely identified by these three attributes alone. It helped to spark sustained academic interest in privacy, although the solution proposed in this work, k-anonymity, is no longer considered sufficient against modern attackers.

Academic paper: Sweeney (2002)
The Netflix Prize
From 2006-2009, Netflix held a competition to improve its recommendation algorithm, releasing an anonymised dataset of movie ratings. Researchers at the University of Texas successfully de-anonymised users by linking the Netflix data with public ratings on IMDb. This demonstrated that even sparse data can be identifying. The controversy later contributed to a privacy lawsuit by subscribers and FTC concerns, after which Netflix abandoned a planned follow-up competition.

Academic preprint: Narayanan & Shmatikov (2007)
AOL Search Log Release
In 2006, AOL released a dataset of 20 million search queries from 650,000 users for academic research. Although user IDs were replaced with random numbers, the queries themselves contained highly personal information. The New York Times famously identified “User 4417749” as Thelma Arnold, a 62-year-old widow, by cross-referencing her searches for “men over 60” and local services in her town.

Article: The New York Times (2006)
Academic paper: Ohm (2010)
Military Base and Aircraft Carrier Locations
In 2018, the fitness app Strava released a global heatmap of user activity. While intended to show popular running routes, it inadvertently revealed the locations and internal layouts of secret military bases in conflict zones like Afghanistan. Soldiers using fitness trackers while patrolling or exercising on base created high-intensity heat signatures that clearly outlined classified perimeters.

Source: The Guardian (2018)

A similar issue arose again in 2026, although this time it was due to an individual sharing their location rather than a company release. In that case, a French sailor recorded runs on the deck of an aircraft carrier, producing conspicuous traces up and down a small patch in the middle of the sea.

Source: BBC News (2026)

It may be tempting to think that anonymisation plus aggregation of data (i.e., releasing summary statistics) is automatically good enough, but even this can easily fail:

US Census Reconstruction Attack
In 2018, the US Census Bureau performed an internal “reconstruction attack” on the 2010 Census data summaries. By solving the billions of published aggregate statistics as a massive system of equations, they successfully reconstructed the individual-level records (including age, sex, race, and ethnicity) for over 142 million people. This discovery was the primary driver for the Census Bureau's switch to using Differential Privacy for the 2020 Census.

Journal Papers: Harvard Data Science Review, Special Issue 2, "Differential Privacy for the 2020 U.S. Census: Can We Make Data Both Private and Useful?"
Technical Report: Abowd et al. (2025)
Finally, people often do not give a second thought to releasing statistical and machine learning models that have been fitted to data. However, a model often encodes enough information about the underlying data that was used to fit it to present a privacy risk. Worse than this, even with access to only predictions produced by a model it can be possible to reverse engineer sensitive information about the underlying data.
Target's Pregnancy Prediction
In 2012, the analytics team at American discount retail chain Target developed a model that could identify pregnant shoppers based on shifts in their purchasing habits (e.g., switching to unscented lotion or buying large quantities of vitamins). The model was so accurate it famously revealed a teenager's pregnancy to her father before she had told him, by sending maternity-related coupons to their home.

Source: The New York Times Magazine (2012)
Genetic Marker Recovery
In 2014, researchers demonstrated that it was possible to recover sensitive genetic markers from a fitted linear model used for calculating Warfarin dosages. By using a small amount of demographic information (like age and weight), they could reverse-engineer the genetic data used to train the model, highlighting that even statistical models can leak private information.

Academic paper: Fredrikson et al. (2014)
Face Reconstruction
In 2015, academic researchers showed that repeated querying of a face-recognition API could be used to reconstruct images of the faces in the original training data. This model inversion attack showed that an adversary can “steal” private training data just by observing how the model responds to different inputs.

Academic paper: Fredrikson et al. (2015)
LLM Training Data Memorization
In 2021, researchers demonstrated that large language models (like GPT-2) “memorize” snippets of their training data. By using specific prompts, they could extract sensitive personally identifiable information, private code, and even verbatim copies of copyrighted text. This training data extraction attack showed that high-capacity models can act as a database of their training data, leaking information they were never intended to reveal.

Academic paper: Carlini et al. (2021)

These examples show that privacy risks arise throughout the statistical modelling process, from data access for model fitting to the potential for reverse-engineering information from a fitted model. This project offers many directions for investigating privacy methods in statistics. Below are a few highlighted options, though motivated students are encouraged to chart their own paths through the relevant literature.

Group Project

A widely used framework for privacy analysis is differential privacy. Therefore, in the group phase of the project this will be the focus of study. Differential privacy quantifies the so-called “membership inference risk”. These methods assume some summary information (like a model) is released, and that an attacker possesses “side channel” information (additional information about individuals who may be in the data). We then quantify the risk that an adversary can identify whether an individual was used in model fitting, allowing us to tune how we release the model to ensure our privacy requirements are met.

The key requirement for differential privacy is that the information released be randomised. For instance, to release the mean salary of all participants in a trial, we would compute the mean and then add noise before release. Differential privacy guides how much noise and what kind will protect individuals’ data.

Differential privacy has its origins in the computer science community (Dwork et al., 2006) and is now a very active area of research, with a plethora of different definitions tailored to suit specific contexts. A good introductory text to the basics is Dwork and Roth (2014).

By the end of the group project you will have learned:

By the end of the group project you will be able to:

Mode of operation and evidence of learning

The project will revolve around learning through reading with a focus on understanding both the privacy theory and a practical ability to deploy that theory in concrete privacy problems. Students will demonstrate their understanding by solving example privacy problems, writing R code to extend those solutions to actual data, and clearly communicating the material in both written and oral formats.

Individual Project

The individual project will build on the knowledge acquired in the group project and will explore additional advanced topics. Possible avenues for further study include:

Mode of operation and evidence of learning

The project will revolve around learning through reading with a focus on understanding both the privacy theory and a practical ability to deploy that theory in concrete privacy problems. Students will demonstrate their understanding by solving example privacy problems, writing R code to extend those solutions to actual data, and clearly communicating the material in both written and oral formats.

Summary

A project in privacy and confidentiality in statistics can be taken in many diverse and interesting directions; the options above are just some of them. It is also important to note that in practice we may combine these approaches, since no single method covers all aspects of a problem.

Additional Information

If you would like more information about this project, to discuss its scope and/or its prerequisites, don't hesitate to contact louis.aslett@durham.ac.uk and james.liley@durham.ac.uk

Prerequisites and Co-requisites

Essential prerequisite modules are Statistical Inference II and Probability I, as well as the ability to program in R or another language with similar capabilities.

If you are interested in the cryptography option in the individual project, then Algebra II may be helpful. Since the cryptography topics can be entirely ignored, these are not essential modules. In particular, note that this is fundamentally a statistics project, so even if you choose to look at cryptographic elements it would not be to the same depth as, for example, a pure mathematics project on the same topic.

References

Abowd, J.M., Adams, T., Ashmead, R., Darais, D., Dey, S., Garfinkel, S.L., Goldschlag, N., Hawes, M.B., Kifer, D., Leclerc, P., Lew, E., Moore, S., Rodríguez, R.A., Tadros, R.N. and Vilhuber, L. (2025) A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census. Working Paper CES-25-57. US Census Bureau. Available at: https://www.census.gov/library/working-papers/2025/adrm/CES-WP-25-57.html

Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015a) ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, arXiv, 1508.06574 [stat.ML]. DOI: 10.48550/arXiv.1508.06574

Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015b) ‘Encrypted statistical machine learning: new privacy preserving methods’, arXiv, 1508.06845 [stat.ML]. DOI: 10.48550/arXiv.1508.06845

Bun, M. and Steinke, T. (2016) ‘Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds’, in M. Hirt and A. Smith (eds) Theory of Cryptography. Berlin, Heidelberg: Springer, pp. 635–658. DOI: 10.1007/978-3-662-53641-4_24.

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A. and Raffel, C. (2021) ‘Extracting Training Data from Large Language Models’, 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650. Available at: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A. and Tramèr, F. (2022) ‘Membership Inference Attacks From First Principles’, 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. DOI: 10.1109/SP46214.2022.9833649

Gentry, C. (2010) ‘Computing arbitrary functions of encrypted data’, Communications of the ACM, 53(3), pp. 97–105. DOI: 10.1145/1666420.1666444

Dong, J., Roth, A. and Su, W.J. (2022) ‘Gaussian Differential Privacy’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1), pp. 3–37. DOI: 10.1111/rssb.12454.

Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) ‘Calibrating Noise to Sensitivity in Private Data Analysis’. In: Halevi, S., Rabin, T. (eds) Theory of Cryptography. Lecture Notes in Computer Science, 3876, pp. 265-284. DOI: 10.1007/11681878_14

Dwork, C. and Roth, A. (2014) ‘The Algorithmic Foundations of Differential Privacy’, Foundations and Trends in Theoretical Computer Science, pp. 211-407. DOI: 10.1561/0400000042

Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D. and Ristenpart, T. (2014) ‘Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing’. 23rd USENIX Security Symposium (USENIX Security 14), pp. 17–32. Available at: https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew

Fredrikson, M., Jha, S. and Ristenpart, T. (2015) ‘Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures’, Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery (CCS ’15), pp. 1322–1333. DOI: 10.1145/2810103.2813677.

Hannun, A., Guo, C. and van der Maaten, L. (2021) ‘Measuring data leakage in machine-learning models with Fisher information’, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, 161, pp. 760-770. Download paper.

Leemann, T., Pawelczyk, M. and Kasneci, G. (2023) ‘Gaussian Membership Inference Privacy’, Advances in Neural Information Processing Systems, 36, pp. 73866–73878. Download paper

Mironov, I. (2017) ‘Rényi Differential Privacy’, 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275. DOI: 10.1109/CSF.2017.11.

Narayanan, A. and Shmatikov, V. (2007) ‘How To Break Anonymity of the Netflix Prize Dataset’. arXiv, cs/0610105 [cs.CR]. DOI: 10.48550/arXiv.cs/0610105.

Ohm, P. (2010) ‘Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization’, UCLA Law Review, 57, pp. 1701–1777. Available at: https://heinonline.org/HOL/P?h=hein.journals/uclalr57&i=1713

Sweeney, L. (2002) ‘k-anonymity: A Model For Protecting Privacy’, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), pp. 557–570. DOI: 10.1142/S0218488502001648.