Privacy
Research area: Statistics
Supervisors: James Liley (Michaelmas) and Louis Aslett (Epiphany)
Background
The collection of data is a pervasive aspect of contemporary life, ranging from personal devices to industrial processes. This data proliferation offers numerous advantages, such as enhanced predictive accuracy and improved decision-making capabilities, but it also introduces substantial privacy issues. Across domains like finance, education, and healthcare, sensitive personal information is collected, stored, and analysed, meaning the potential for misuse or unauthorised disclosure of this data can have extensive repercussions for individuals.
This is more than a mere theoretical concern: there have been numerous serious, real-world privacy breaches. The following are some illustrative examples, with links to further reading (click to expand). First, it has been repeatedly shown that anonymisation alone is insufficient to protect privacy:
Massachusetts Health Data De-anonymisation
Academic paper: Sweeney (2002)
The Netflix Prize
Academic preprint: Narayanan & Shmatikov (2007)
AOL Search Log Release
Article: The New York Times (2006)
Academic paper: Ohm (2010)
Military Base and Aircraft Carrier Locations
Source: The Guardian (2018)
A similar issue arose again in 2026, although this time it was due to an individual sharing their location rather than a company release. In that case, a French sailor recorded runs on the deck of an aircraft carrier, producing conspicuous traces up and down a small patch in the middle of the sea.
Source: BBC News (2026)
It may be tempting to think that anonymisation plus aggregation of data (i.e., releasing summary statistics) is automatically good enough, but even this can easily fail:
US Census Reconstruction Attack
Journal Papers: Harvard Data Science Review, Special Issue 2, "Differential Privacy for the 2020 U.S. Census: Can We Make Data Both Private and Useful?"
Technical Report: Abowd et al. (2025)
Target's Pregnancy Prediction
Source: The New York Times Magazine (2012)
Genetic Marker Recovery
Academic paper: Fredrikson et al. (2014)
Face Reconstruction
Academic paper: Fredrikson et al. (2015)
LLM Training Data Memorization
Academic paper: Carlini et al. (2021)
These examples show that privacy risks arise throughout the statistical modelling process, from data access for model fitting to the potential for reverse-engineering information from a fitted model. This project offers many directions for investigating privacy methods in statistics. Below are a few highlighted options, though motivated students are encouraged to chart their own paths through the relevant literature.
Group Project
A widely used framework for privacy analysis is differential privacy. Therefore, in the group phase of the project this will be the focus of study. Differential privacy quantifies the so-called “membership inference risk”. These methods assume some summary information (like a model) is released, and that an attacker possesses “side channel” information (additional information about individuals who may be in the data). We then quantify the risk that an adversary can identify whether an individual was used in model fitting, allowing us to tune how we release the model to ensure our privacy requirements are met.
The key requirement for differential privacy is that the information released be randomised. For instance, to release the mean salary of all participants in a trial, we would compute the mean and then add noise before release. Differential privacy guides how much noise and what kind will protect individuals’ data.
Differential privacy has its origins in the computer science community (Dwork et al., 2006) and is now a very active area of research, with a plethora of different definitions tailored to suit specific contexts. A good introductory text to the basics is Dwork and Roth (2014).
By the end of the group project you will have learned:
- The definition of differential privacy, including neighbouring datasets, the parameters \( \varepsilon \) and \( \delta \), and the interpretation of differential privacy guarantees.
- Basic mechanisms for achieving differential privacy, such as the Laplace and Gaussian mechanisms, and the exponential mechanism for general releases that balance utility and privacy.
- Sensitivity of queries and statistics, and its role in calibrating noise for privacy-preserving release.
- Privacy loss under multiple releases, including composition theorems for standard release mechanisms.
- Post-processing and invariance properties of differential privacy, and their implications for downstream analysis and data sharing.
- Methods for proving that an algorithm is differentially private, including direct verification from the definition and the use of standard composition results.
By the end of the group project you will be able to:
- Write R/Python code to implement basic differentially private mechanisms for common statistical queries, such as counts, proportions, means, and histograms.
- Calibrate privacy-preserving noise using appropriate sensitivity calculations and specified privacy parameters \( \varepsilon \) and \( \delta \).
- Track and manage privacy loss across multiple analyses using basic composition principles and a privacy budget.
- Justify mathematically why a proposed procedure satisfies differential privacy by verifying the required conditions or appealing to standard privacy results.
Mode of operation and evidence of learning
The project will revolve around learning through reading with a focus on understanding both the privacy theory and a practical ability to deploy that theory in concrete privacy problems. Students will demonstrate their understanding by solving example privacy problems, writing R code to extend those solutions to actual data, and clearly communicating the material in both written and oral formats.
Individual Project
The individual project will build on the knowledge acquired in the group project and will explore additional advanced topics. Possible avenues for further study include:
- Advanced differential privacy methods, such as Rényi (Mironov 2017) and zero-concentrated differential privacy (Bun & Steinke, 2016).
- Recent research developments such as \( f \)-differential privacy (Dong et al., 2022), or \( f \)-membership inference privacy (Leemann et al., 2023).
- Reconstruction risk approaches, which take a different view of privacy, such as Fisher information loss (Hannun et al., 2021).
- Incorporating cryptographic methods, such as homomorphic encryption (Gentry, 2010, Aslett et al., 2015a/b), homomorphic secret sharing, and multiparty computation (see, e.g., this talk).
Mode of operation and evidence of learning
The project will revolve around learning through reading with a focus on understanding both the privacy theory and a practical ability to deploy that theory in concrete privacy problems. Students will demonstrate their understanding by solving example privacy problems, writing R code to extend those solutions to actual data, and clearly communicating the material in both written and oral formats.
Summary
A project in privacy and confidentiality in statistics can be taken in many diverse and interesting directions; the options above are just some of them. It is also important to note that in practice we may combine these approaches, since no single method covers all aspects of a problem.
Additional Information
If you would like more information about this project, to discuss its scope and/or its prerequisites, don't hesitate to contact louis.aslett@durham.ac.uk and james.liley@durham.ac.uk
Prerequisites and Co-requisites
Essential prerequisite modules are Statistical Inference II and Probability I, as well as the ability to program in R or another language with similar capabilities.
If you are interested in the cryptography option in the individual project, then Algebra II may be helpful. Since the cryptography topics can be entirely ignored, these are not essential modules. In particular, note that this is fundamentally a statistics project, so even if you choose to look at cryptographic elements it would not be to the same depth as, for example, a pure mathematics project on the same topic.
References
Abowd, J.M., Adams, T., Ashmead, R., Darais, D., Dey, S., Garfinkel, S.L., Goldschlag, N., Hawes, M.B., Kifer, D., Leclerc, P., Lew, E., Moore, S., Rodríguez, R.A., Tadros, R.N. and Vilhuber, L. (2025) A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census. Working Paper CES-25-57. US Census Bureau. Available at: https://www.census.gov/library/working-papers/2025/adrm/CES-WP-25-57.html
Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015a) ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, arXiv, 1508.06574 [stat.ML]. DOI: 10.48550/arXiv.1508.06574
Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015b) ‘Encrypted statistical machine learning: new privacy preserving methods’, arXiv, 1508.06845 [stat.ML]. DOI: 10.48550/arXiv.1508.06845
Bun, M. and Steinke, T. (2016) ‘Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds’, in M. Hirt and A. Smith (eds) Theory of Cryptography. Berlin, Heidelberg: Springer, pp. 635–658. DOI: 10.1007/978-3-662-53641-4_24.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A. and Raffel, C. (2021) ‘Extracting Training Data from Large Language Models’, 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650. Available at: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.
Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A. and Tramèr, F. (2022) ‘Membership Inference Attacks From First Principles’, 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. DOI: 10.1109/SP46214.2022.9833649
Gentry, C. (2010) ‘Computing arbitrary functions of encrypted data’, Communications of the ACM, 53(3), pp. 97–105. DOI: 10.1145/1666420.1666444
Dong, J., Roth, A. and Su, W.J. (2022) ‘Gaussian Differential Privacy’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1), pp. 3–37. DOI: 10.1111/rssb.12454.
Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) ‘Calibrating Noise to Sensitivity in Private Data Analysis’. In: Halevi, S., Rabin, T. (eds) Theory of Cryptography. Lecture Notes in Computer Science, 3876, pp. 265-284. DOI: 10.1007/11681878_14
Dwork, C. and Roth, A. (2014) ‘The Algorithmic Foundations of Differential Privacy’, Foundations and Trends in Theoretical Computer Science, pp. 211-407. DOI: 10.1561/0400000042
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D. and Ristenpart, T. (2014) ‘Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing’. 23rd USENIX Security Symposium (USENIX Security 14), pp. 17–32. Available at: https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew
Fredrikson, M., Jha, S. and Ristenpart, T. (2015) ‘Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures’, Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery (CCS ’15), pp. 1322–1333. DOI: 10.1145/2810103.2813677.
Hannun, A., Guo, C. and van der Maaten, L. (2021) ‘Measuring data leakage in machine-learning models with Fisher information’, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, 161, pp. 760-770. Download paper.
Leemann, T., Pawelczyk, M. and Kasneci, G. (2023) ‘Gaussian Membership Inference Privacy’, Advances in Neural Information Processing Systems, 36, pp. 73866–73878. Download paper
Mironov, I. (2017) ‘Rényi Differential Privacy’, 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275. DOI: 10.1109/CSF.2017.11.
Narayanan, A. and Shmatikov, V. (2007) ‘How To Break Anonymity of the Netflix Prize Dataset’. arXiv, cs/0610105 [cs.CR]. DOI: 10.48550/arXiv.cs/0610105.
Ohm, P. (2010) ‘Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization’, UCLA Law Review, 57, pp. 1701–1777. Available at: https://heinonline.org/HOL/P?h=hein.journals/uclalr57&i=1713
Sweeney, L. (2002) ‘k-anonymity: A Model For Protecting Privacy’, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), pp. 557–570. DOI: 10.1142/S0218488502001648.