Privacy

Supervisor: Prof. L.J.M. Aslett

The collection of data is a pervasive aspect of contemporary life, ranging from personal devices to industrial processes. This data proliferation offers numerous advantages, such as enhanced predictive accuracy and improved decision-making capabilities, but it also introduces substantial privacy issues. Across domains like finance, education, and healthcare, sensitive personal information is collected, stored, and analysed, meaning the potential for misuse or unauthorized disclosure of this data can have extensive repercussions for individuals.

An easy to relate to example where these concerns are particularly pertinent is medical statistics. Researchers collect and analyse patient data to advance diagnosis, treatment, and disease prevention. However, handling sensitive information like medical histories and genetic profiles introduces risks of unauthorized access or disclosure, which can impact not only individual patients but also trust in the broader healthcare system. This creates a tension between the necessity for individual privacy and the collective benefits derived from widely accessible medical data. Risks exist throughout the statistical modelling process, from data access for model fitting to the potential of reverse-engineering information from a fitted model.

Traditionally, privacy was preserved through methods like anonymisation or data aggregation. However, the vast quantity of data available for side-channel attacks and the need for detailed model fitting in modern contexts have driven extensive research into more sophisticated techniques.

This project offers many directions for investigating privacy methods in statistics. Below are a few highlighted options, though motivated students are encouraged to chart their own paths through the relevant literature.

Fitted model risk: reverse engineering

It might be surprising, but revealing a fitted model is not always safe! An adversary examining the model, possibly with auxiliary information at their disposal, can often reverse-engineer details about the original data used for model fitting.

There are at least two problems: (i) at a high level, if may be problematic if we can identify who was present in the data used for the model (for example, for certain disease models mere presence in the data set is informative); and (ii) at a granular level, it may be possible to reconstruct the data (or bounds on the values) given a fitted model.

Option I: differential privacy

So-called "membership inference risk" can be quantified using a stable of techniques called differential privacy. These methods assume some summary information (like a model) is released, and that an attacker possesses "side channel" information (additional information about individuals who may be in the data). We then quantify the risk that an adversary can identify whether an individual was used in model fitting, allowing us to tune how we release the model to ensure our privacy requirements are met.

The key requirement for differential privacy is that the information released be randomised. For instance, to release the mean salary of all participants in a trial, we would compute the mean and then add noise before release. Differential privacy guides how much noise and what kind will protect individuals' data.

Differential privacy has its origins in the computer science community (Dwork et al., 2006) and is now a very active area of research, with a plethora of different definitions tailored to suit specific contexts. A good introductory text to the basics is Dwork and Roth (2014). There are very recent generalisations, including \(f-\)differential privacy, which set this on a firmer statistical footing (Dong et al., 2022)

Option II: reconstruction risk

There are many situations where simply knowing that someone was used in fitting a model does not represent a major risk. Instead, there may be more concern about the possibility of (partially) reconstructing an individual's data by reverse-engineering the released model.

Like differential privacy, this approach depends on injecting noise into the released model. However, it differs by directly seeking to quantify the mean square error an adversary may make in trying to reconstruct actual data values from the fitted model, again with the possible use of extra side-channel information.

This is a newer area of research than differential privacy, with recent attempts made in Hannun et al. (2021) which has led to multiple follow-up works.

Model fitting risk: conducting statistics blindfolded

Access to sensitive data sets is closely overseen by the holding organisation, for example NHS Trusts in the case of many medical data sets. There is usually a long and involved contractual process to cover all data governance concerns prior to access to data being granted. Even once legal agreements are in place and access is allowed, researchers are often only allowed to work with the data in tightly controlled environments, referred to as "safe havens".

At least two problems persist: (i) slow research progress due to legal processes; and (ii) limited access to one data set at a time.

Option III: Cryptography

In terms of (i), these contractual processes can take a very long time: it would be helpful if we could model data without "seeing" it. It turns out that we can indeed perform some statistical analyses while effectively "blindfolded" from the data. This can be achieved by using modern homomorphic encryption methods.

Cryptographic methods have been used since the dawn of civilisation in order to keep information secret. Fundamentally, an encryption algorithm is a mathematical transformation of your data in such a way as to make it difficult to recover the original information without knowledge of some secret, commonly called a private key, \(k_s\).

Homomorphic encryption schemes are a remarkable subset of encryption schemes which additionally support the homomorphic property. Let all the messages belong to some messages space \( M \) and let encryption under a public key \(k_p\) produce cipher texts in a corresponding cipher text space \( C \). Then a scheme is homomorphic if there is a set of operations \( \circ \in \mathcal{F}_M \) acting in message space (such as addition) that have corresponding operations \( \diamond \in \mathcal{F}_C \) acting in cipher text space satisfying the property: $$ \texttt{Dec}(k_s, \texttt{Enc}(k_p, m_1) \diamond \texttt{Enc}(k_p, m_2)) = m_1 \circ m_2 $$ For example, the simple scheme in Gentry (2010) describes a method where \( \mathcal{F}_M = \{+, \times\} \) and \( \mathcal{F}_C = \{+, \times\} \), though there is no restriction that the operations must correspond in all schemes. For example, Paillier encryption (Paillier, 1999) is homomorphic only for addition, with \( \mathcal{F}_M = \{+\} \) but where \( \mathcal{F}_C = \{\times\} \). It is rarely mentioned, but interestingly even the famous RSA cryptosystem (Rivest et al., 1978), which you might have studied in third year, is homomorphic, although only for multiplication, with \( \mathcal{F}_M = \{\times\} \) and \( \mathcal{F}_C = \{\times\} \) (note that this is only true for the simplest original form of RSA, not more recent variants).

For more a more detailed introduction, a review of homomorphic encryption for statisticians is Aslett et al (2015a), or see the excellent and entertaining general introduction to homomorphic encryption concepts by Gentry (2010). Note that this is, at heart, a statistics project and so we would not go into the depth that a pure mathematics project would on these topics.

However, homomorphic encryption is not quite a panacea: there are precious few statistical models that can be fitted directly to homomorphically encrypted data, because the operations that are allowed are so simple. Hence, it is necessary to adapt existing methodology or create entirely new approaches. For an example of this, see Aslett et al (2015b).

Option IV: Multiparty computation

In respect of problem (ii), as we know from the statistics you have studied throughout your degree many estimators have accuracy which improves at a rate \( \mathcal{O}(n^{-\frac{1}{2}}) \), so that more and more data is required to achieve improved accuracy. Returning to the medical example, this problem is particularly acute in the case of rare diseases, where a single NHS trust may simply have too few cases to be able to fit interesting or practically useful models. Here, we want to achieve the "blindfolding" mentioned above, but in a way that lets us combine many data sets at the same time.

Fortunately, there is another suite of methods which are related to cryptography, but subtly different. These include multiparty computation methods like homomorphic secret sharing, where one can securely perform computations on data held by different organisations with only the final result being visible to everyone.

These methods are less common in statistics and is my own active area of research. You can see a talk I gave a few years ago on using homomorphic secret sharing for Bayesian inference by clicking this link for more background. As this is a newer area, there are various genuine research project opportunities here.

Summary

There are many diverse and interesting directions a project in privacy and confidentiality in statistics can be taken, the above 4 options being just some of them. It is also important to note that in practice, we may combine these approaches since no single method covers all aspects of a problem.

Modules

Essential prior modules are Statistical Inference II and Probability I, as well as the ability to program in R or another language with similar capabilities.

Having studied any of the statistics modules involving different modelling approaches (e.g. Statistical Modelling II, Advanced Statistical Modelling III, Bayesian Computation and Modelling III, Machine Learning and Neural Networks III, etc) will be beneficial since privacy could be examined in any of these settings. Since any of those modules are relevant no particular one is deemed an essential prior module.

If you are interested in the cryptography route through this project, then any of Cryptography & Codes III, Elementary Number Theory II, or Algebra II may be helpful. Since cryptography is not the only route through this project these are not essential prior modules. In particular, note that this is fundamentally a statistics project, so we would not study cryptographic elements to the same depth as, for example, a pure mathematics project on the same topic.

References

Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015a) ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, arXiv, 1508.06574 [stat.ML]. DOI: 10.48550/arXiv.1508.06574

Aslett, L.J.M., Esperança, P. and Holmes, C.C. (2015b) ‘Encrypted statistical machine learning: new privacy preserving methods’, arXiv, 1508.06845 [stat.ML]. DOI: 10.48550/arXiv.1508.06845

Gentry, C. (2010) ‘Computing arbitrary functions of encrypted data’, Communications of the ACM, 53(3), pp. 97–105. DOI: 10.1145/1666420.1666444

Dong, J., Roth, A. and Su, W.J. (2022) ‘Gaussian Differential Privacy’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1), pp. 3–37. DOI: 10.1111/rssb.12454.

Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) ‘Calibrating Noise to Sensitivity in Private Data Analysis’. In: Halevi, S., Rabin, T. (eds) Theory of Cryptography. Lecture Notes in Computer Science, 3876, pp. 265-284. DOI: 10.1007/11681878_14

Dwork, C. and Roth, A. (2014) ‘The Algorithmic Foundations of Differential Privacy’, Foundations and Trends in Theoretical Computer Science, pp. 211-407. DOI: 10.1561/0400000042

Hannun, A., Guo, C. and van der Maaten, L. (2021) ‘Measuring data leakage in machine-learning models with Fisher information’, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, 161, pp. 760-770. Download paper.

Paillier, P. (1999) ‘Public-key cryptosystems based on composite degree residuosity classes’. In: Advances in Cryptology - EUROCRYPT'99, pp. 223-238. DOI: 10.1007/3-540-48910-X_16

Rivest, R.L., Shamir, A. and Adleman, L. (1978) ‘A method for obtaining digital signatures and public-key cryptosystems’, Communications of the ACM, 21(2), pp. 120-126. DOI: 10.1145/359340.359342.