(16th February 2018)
Format | Download |
Mac OS X binary | EncryptedStats_0.55.tgz |
Windows binary | Not available |
Source code | EncryptedStats_0.55.tar.gz |
What’s new recently?
- Completely random forests.
- Semi-parametric naïve Bayes.
This package enables provides implementations of existing or novel statistical machine learning techniques implemented in a manner which is amenable to encrypted computation under a homomorphic encryption scheme supporting both addition and multiplication. The implementations are such that they can be run on both encrypted and unencrypted content, making use of the full operator overloading supported by the HomomorphicEncryption
R package (see that webpage; Aslett et al, 2015a; and Gentry, 2010, for an introduction to homomorphic encryption).
Package details
This package is implemented in pure R so that installation is easy, but depends on the HomomorphicEncryption
package for all cryptographic routines. The latter is implemented in high performance C/C++ code and requires some high performance numerical system libraries: please see the HomomorphicEncryption
installation guide before installing this package.
Installation
install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tgz", repos=NULL)
install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tar.gz", repos=NULL)
Package Usage
This package currently implements the two novel techniques presented in Aslett, Esperança and Holmes (2015b). For details of the techniques please see that publication. A short explanation of their use in this package follows:
Completely random forests (CRFs)
There are three steps in working with CRFs in the package: i) grow a forest; ii) fit data to each tree (accumulate votes); and iii) predict a new observation.
When working with CRFs, the data must already be in the representation of Method 1 from Aslett, Esperança and Holmes (2015b), in either an encrypted or unencrypted matrix in R.
Imagine a toy data set consisting of \(N\) observations of two variables (v1
and v2
) where the first is ordinal and the second categorical. Assume that these are in quintiles so that the data matrix is \(N \times 10\), stored in the variable called X
, say, in R. Also assume the responses of the training observations are similarly stored in y
.
i) Forest growth
Tree growth is blind from the data, so you only need to identify the variable names of each column of the (un)encrypted data matrix and their type – either ordinal ("ord"
) or categorical ("cat"
). For the toy example,
> Xvars <- rep(c("v1", "v2"), each=5)
> Xvars
[1] "v1" "v1" "v1" "v1" "v1" "v2" "v2" "v2" "v2" "v2"
This vector of names is used to tell the tree growth algorithm that the first 5 columns of X
contain a quantisation of variable v1
and the second 5 columns contain a quantisation of variable v2
.
> Xtype <- c("v1"="ord", "v2"="cat")
> Xtype
v1 v2
"ord" "cat"
This vector then specifies the type for each variable. Together these can be passed to the forest growth algorithm, along with the specification of the number of trees to grow (T
) and how deep to grow the trees (L
):
> forest <- CRF.grow(Xvars, Xtype, T=200, L=2)
ii) Fit the data
The fitting step is now straight-forward. The forest is given to the fitting function, together with the data matrix and the responses, either encrypted or unencrypted. The only additional option is what size of stochastic fraction estimate to use (the \(M\) parameter from the paper), passed as argument resamp
.
To do this unencrypted, one simply runs:
> fit <- CRF.fit(forest, X, y, resamp=8)
If X
and y
are encrypted, then the algorithm requires the value 1 encrypted, as well as \(M\) both encrypted and in plain text, since the stochastic fraction estimate is \(M-\sum_{i=1}^M \eta_i + 1\), where the \(\eta_i\) come from X
. Therefore, the public key must be provided so that the algorithm can produce encrypted versions of these,
> fit <- CRF.fit(forest, X, y, resamp=8, pk=keys$pk)
iii) Prediction
Imagine now that there are \(N’\) test observations for the toy example, similarly stored in an unencrypted or encrypted \(N’ \times P\) matrix, newX
. Then prediction proceeds using the forest and fit from the previous commands:
> pred <- CRF.pred(forest, fit, newX)
For similar reasons to above, the public key is required when computing with encrypted data,
> pred <- CRF.pred(forest, fit, newX, pk=keys$pk)
Upon completion, pred
will be an \(N’ \times 2\) matrix with votes for class \(j\) on observation \(i\) being in the \((i,j)\)th entry.
Semi-parametric naïve Bayes (SNB)
There are two steps in working with the SNB classifier in the package: i) fit a model to the data, i.e., estimate the factors that make up the parameters \({ \hat{\theta}, \hat{\alpha} , \hat{\beta} }\); ii) predict a new observation.
The user again has the choice of working with encrypted or unencrypted data (this is automatically detected), with the provision that both steps must use the same type of data, i.e., either both use encrypted data or both use unecrypted data.
Assume we have an \(N \times P\) design matrix X
and a length \(N\) binary response vector y
; and let cX
and cy
denote the corresponding ciphertexts, encrypted with the package HomomorphicEncryption
.
i) Estimation
Besides the data there are only two parameters in the estimation step: paired
controls whether paired or unpaired optimisation is to be used; and pk
is the public key used to encrypt cX
and cy
– it is used to encrypt some constants required in the estimation process and is only needed if encrypted data is used.
Estimating the class counts \({ \text{counts}(y=0), \text{counts}(y=1) }\) and the coefficients \({ a_{j}, b_{j}, d_{j}}\) and is then straightforward using the fitting function SNB.fit
. Say \(N=10\) and \(P=2\); then, with unencrypted data we obtain
> snb1.fit <- SNB.fit(X, y, paired=T)
> snb1.fit
$counts.y
[1] 4 6
$coeffs
[,1] [,2]
a 110 462
b -8 -120
d 221 105
while with encrypted data we obtain
> snb2.fit <- SNB.fit(cX, cy, paired=T, pk=keys$pk)
> snb2.fit
$counts.y
Vector of 2 Fan and Vercauteren cipher texts
$coeffs
Matrix of 3 x 2 Fan and Vercauteren cipher texts
ii) Prediction
The function SNB.fit
generates an object of class SNB
, which can be directly fed into the base R predict
function.
Apart from the fitted model
and the testing data newX
, there are only two parameters:
type
controls the type of prediction to the returned (raw coefficients or conditional class probabilities);
sk
is the secret key – required if predictions are to be decrypted.
Continuing the example above, we can obtain the probabilities of class \(y=1\),
> snb2.predProb <- predict(model=snb.fit2, newX=cX, type="prob", sk=keys$sk)
> snb2.predProb
[1] 0.2212818 0.1973391 0.7221404 0.8975560 0.7293452 0.4353264 0.4621838
[8] 0.4442447 0.7364313 0.8907048
or obtain the class counts and coefficients \({ e_{ij}, d_{j} }\) for all testing observations,
> snb2.predRaw <- predict(model=snb.fit2, newX=cX, type="raw", sk=keys$sk)
> snb2.predRaw
$counts.y
[1] 4 6
$coeffs.e
[,1] [,2]
[1,] 102 -138
[2,] 70 -138
[3,] 86 102
[4,] 102 222
[5,] 94 102
[6,] 70 -18
[7,] 94 -18
[8,] 78 -18
[9,] 102 102
[10,] 86 222
$coeffs.d
[1] 221 105
Citation information
If you use this package, please use the following citation:
Aslett, L. J. M., Esperança, P. M. and Holmes, C. C.(2015), Encrypted statistical machine learning: new privacy preserving methods. Technical report, University of Oxford.
@techreport{Aslett2015,
Author = {L. J. M. Aslett and P. M. Esperan{\c c}a and C. C. Holmes},
Institution = {University of Oxford},
Title = {Encrypted statistical machine learning: new privacy preserving methods},
Year = {2015}}
References
Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015a), ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, arXiv.
Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015b), ‘Encrypted statistical machine learning: new privacy preserving methods’, arXiv.
Gentry, C. (2010), ‘Computing arbitrary functions of encrypted data’, Communications of the ACM 53(3), 97–105.
R Core Team (2015), R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/
Share |
Follow |