EncryptedStats R Package
(16th February 2018)

 Format Download Mac OS X binary EncryptedStats_0.55.tgz Windows binary Not available Source code EncryptedStats_0.55.tar.gz

What’s new recently?
• Completely random forests.
• Semi-parametric naïve Bayes.

< Back to homepage

This package enables provides implementations of existing or novel statistical machine learning techniques implemented in a manner which is amenable to encrypted computation under a homomorphic encryption scheme supporting both addition and multiplication. The implementations are such that they can be run on both encrypted and unencrypted content, making use of the full operator overloading supported by the HomomorphicEncryption R package (see that webpage; Aslett et al, 2015a; and Gentry, 2010, for an introduction to homomorphic encryption).

## Package details

This package is implemented in pure R so that installation is easy, but depends on the HomomorphicEncryption package for all cryptographic routines. The latter is implemented in high performance C/C++ code and requires some high performance numerical system libraries: please see the HomomorphicEncryption installation guide before installing this package.

### Installation

To install, within R simply run:

 install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tgz", repos=NULL) 

To install, within R simply run:

 install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tar.gz", repos=NULL) 

A Windows package is not currently provided, though it should follow in the coming months.

### Package Usage

This package currently implements the two novel techniques presented in Aslett, Esperança and Holmes (2015b). For details of the techniques please see that publication. A short explanation of their use in this package follows:

### Completely random forests (CRFs)

There are three steps in working with CRFs in the package: i) grow a forest; ii) fit data to each tree (accumulate votes); and iii) predict a new observation.

When working with CRFs, the data must already be in the representation of Method 1 from Aslett, Esperança and Holmes (2015b), in either an encrypted or unencrypted matrix in R.

Imagine a toy data set consisting of $$N$$ observations of two variables (v1 and v2) where the first is ordinal and the second categorical. Assume that these are in quintiles so that the data matrix is $$N \times 10$$, stored in the variable called X, say, in R. Also assume the responses of the training observations are similarly stored in y.

i) Forest growth

Tree growth is blind from the data, so you only need to identify the variable names of each column of the (un)encrypted data matrix and their type – either ordinal ("ord") or categorical ("cat"). For the toy example,

> Xvars <- rep(c("v1", "v2"), each=5)
> Xvars
[1] "v1" "v1" "v1" "v1" "v1" "v2" "v2" "v2" "v2" "v2"


This vector of names is used to tell the tree growth algorithm that the first 5 columns of X contain a quantisation of variable v1 and the second 5 columns contain a quantisation of variable v2.

> Xtype <- c("v1"="ord", "v2"="cat")
> Xtype
v1    v2
"ord" "cat"


This vector then specifies the type for each variable. Together these can be passed to the forest growth algorithm, along with the specification of the number of trees to grow (T) and how deep to grow the trees (L):

> forest <- CRF.grow(Xvars, Xtype, T=200, L=2)


ii) Fit the data

The fitting step is now straight-forward. The forest is given to the fitting function, together with the data matrix and the responses, either encrypted or unencrypted. The only additional option is what size of stochastic fraction estimate to use (the $$M$$ parameter from the paper), passed as argument resamp.

To do this unencrypted, one simply runs:

> fit <- CRF.fit(forest, X, y, resamp=8)


If X and y are encrypted, then the algorithm requires the value 1 encrypted, as well as $$M$$ both encrypted and in plain text, since the stochastic fraction estimate is $$M-\sum_{i=1}^M \eta_i + 1$$, where the $$\eta_i$$ come from X. Therefore, the public key must be provided so that the algorithm can produce encrypted versions of these,

> fit <- CRF.fit(forest, X, y, resamp=8, pk=keys$pk)  iii) Prediction Imagine now that there are $$N’$$ test observations for the toy example, similarly stored in an unencrypted or encrypted $$N’ \times P$$ matrix, newX. Then prediction proceeds using the forest and fit from the previous commands: > pred <- CRF.pred(forest, fit, newX)  For similar reasons to above, the public key is required when computing with encrypted data, > pred <- CRF.pred(forest, fit, newX, pk=keys$pk)


Upon completion, pred will be an $$N’ \times 2$$ matrix with votes for class $$j$$ on observation $$i$$ being in the $$(i,j)$$th entry.

### Semi-parametric naïve Bayes (SNB)

There are two steps in working with the SNB classifier in the package: i) fit a model to the data, i.e., estimate the factors that make up the parameters $${ \hat{\theta}, \hat{\alpha} , \hat{\beta} }$$; ii) predict a new observation.

The user again has the choice of working with encrypted or unencrypted data (this is automatically detected), with the provision that both steps must use the same type of data, i.e., either both use encrypted data or both use unecrypted data.

Assume we have an $$N \times P$$ design matrix X and a length $$N$$ binary response vector y; and let cX and cy denote the corresponding ciphertexts, encrypted with the package HomomorphicEncryption.

i) Estimation

Besides the data there are only two parameters in the estimation step: paired controls whether paired or unpaired optimisation is to be used; and pk is the public key used to encrypt cX and cy – it is used to encrypt some constants required in the estimation process and is only needed if encrypted data is used.

Estimating the class counts $${ \text{counts}(y=0), \text{counts}(y=1) }$$ and the coefficients $${ a_{j}, b_{j}, d_{j}}$$ and is then straightforward using the fitting function SNB.fit. Say $$N=10$$ and $$P=2$$; then, with unencrypted data we obtain

> snb1.fit <- SNB.fit(X, y, paired=T)
> snb1.fit
$counts.y [1] 4 6$coeffs
[,1] [,2]
a  110  462
b   -8 -120
d  221  105


while with encrypted data we obtain

> snb2.fit <- SNB.fit(cX, cy, paired=T, pk=keys$pk) > snb2.fit$counts.y
Vector of 2 Fan and Vercauteren cipher texts
$coeffs Matrix of 3 x 2 Fan and Vercauteren cipher texts  ii) Prediction The function SNB.fit generates an object of class SNB, which can be directly fed into the base R predict function. Apart from the fitted model and the testing data newX, there are only two parameters: type controls the type of prediction to the returned (raw coefficients or conditional class probabilities); sk is the secret key – required if predictions are to be decrypted. Continuing the example above, we can obtain the probabilities of class $$y=1$$, > snb2.predProb <- predict(model=snb.fit2, newX=cX, type="prob", sk=keys$sk)
> snb2.predProb
[1] 0.2212818 0.1973391 0.7221404 0.8975560 0.7293452 0.4353264 0.4621838
[8] 0.4442447 0.7364313 0.8907048


or obtain the class counts and coefficients $${ e_{ij}, d_{j} }$$ for all testing observations,

> snb2.predRaw <- predict(model=snb.fit2, newX=cX, type="raw", sk=keys$sk) > snb2.predRaw$counts.y
[1] 4 6
$coeffs.e [,1] [,2] [1,] 102 -138 [2,] 70 -138 [3,] 86 102 [4,] 102 222 [5,] 94 102 [6,] 70 -18 [7,] 94 -18 [8,] 78 -18 [9,] 102 102 [10,] 86 222$coeffs.d
[1] 221 105


## Citation information

If you use this package, please use the following citation:

Aslett, L. J. M., Esperança, P. M. and Holmes, C. C.(2015), Encrypted statistical machine learning: new privacy preserving methods. Technical report, University of Oxford.

@techreport{Aslett2015,
Author = {L. J. M. Aslett and P. M. Esperan{\c c}a and C. C. Holmes},
Institution = {University of Oxford},
Title = {Encrypted statistical machine learning: new privacy preserving methods},
Year = {2015}}


## References

Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015a), ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, arXiv.

Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015b), ‘Encrypted statistical machine learning: new privacy preserving methods’, arXiv.

Gentry, C. (2010), ‘Computing arbitrary functions of encrypted data’, Communications of the ACM 53(3), 97–105.

R Core Team (2015), R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/