__EncryptedStats R Package__

(16th February 2018)(16th February 2018)

Format | Download |

Mac OS X binary | EncryptedStats_0.55.tgz |

Windows binary | Not available |

Source code | EncryptedStats_0.55.tar.gz |

**What’s new recently?**

- Completely random forests.
- Semi-parametric naïve Bayes.

This package enables provides implementations of existing or novel statistical machine learning techniques implemented in a manner which is amenable to encrypted computation under a homomorphic encryption scheme supporting both addition and multiplication. The implementations are such that they can be run on both encrypted and unencrypted content, making use of the full operator overloading supported by the `HomomorphicEncryption`

R package (see that webpage; Aslett et al, 2015a; and Gentry, 2010, for an introduction to homomorphic encryption).

## Package details

This package is implemented in pure R so that installation is easy, but depends on the `HomomorphicEncryption`

package for all cryptographic routines. The latter is implemented in high performance C/C++ code and requires some high performance numerical system libraries: please see the `HomomorphicEncryption`

installation guide before installing this package.

### Installation

```
install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tgz", repos=NULL)
```

```
install.packages("http://www.louisaslett.com/EncryptedStats/dl/EncryptedStats_0.5.tar.gz", repos=NULL)
```

### Package Usage

This package currently implements the two novel techniques presented in Aslett, Esperança and Holmes (2015b). For details of the techniques please see that publication. A short explanation of their use in this package follows:

### Completely random forests (CRFs)

There are three steps in working with CRFs in the package: i) grow a forest; ii) fit data to each tree (accumulate votes); and iii) predict a new observation.

When working with CRFs, the data must already be in the representation of Method 1 from Aslett, Esperança and Holmes (2015b), in either an encrypted or unencrypted matrix in R.

Imagine a toy data set consisting of \(N\) observations of two variables (`v1`

and `v2`

) where the first is ordinal and the second categorical. Assume that these are in quintiles so that the data matrix is \(N \times 10\), stored in the variable called `X`

, say, in R. Also assume the responses of the training observations are similarly stored in `y`

.

**i) Forest growth**

Tree growth is blind from the data, so you only need to identify the variable names of each column of the (un)encrypted data matrix and their type – either ordinal (`"ord"`

) or categorical (`"cat"`

). For the toy example,

```
> Xvars <- rep(c("v1", "v2"), each=5)
> Xvars
[1] "v1" "v1" "v1" "v1" "v1" "v2" "v2" "v2" "v2" "v2"
```

This vector of names is used to tell the tree growth algorithm that the first 5 columns of `X`

contain a quantisation of variable `v1`

and the second 5 columns contain a quantisation of variable `v2`

.

```
> Xtype <- c("v1"="ord", "v2"="cat")
> Xtype
v1 v2
"ord" "cat"
```

This vector then specifies the type for each variable. Together these can be passed to the forest growth algorithm, along with the specification of the number of trees to grow (`T`

) and how deep to grow the trees (`L`

):

```
> forest <- CRF.grow(Xvars, Xtype, T=200, L=2)
```

**ii) Fit the data**

The fitting step is now straight-forward. The forest is given to the fitting function, together with the data matrix and the responses, either encrypted or unencrypted. The only additional option is what size of stochastic fraction estimate to use (the \(M\) parameter from the paper), passed as argument `resamp`

.

To do this unencrypted, one simply runs:

```
> fit <- CRF.fit(forest, X, y, resamp=8)
```

If `X`

and `y`

are encrypted, then the algorithm requires the value 1 encrypted, as well as \(M\) both encrypted and in plain text, since the stochastic fraction estimate is \(M-\sum_{i=1}^M \eta_i + 1\), where the \(\eta_i\) come from `X`

. Therefore, the public key must be provided so that the algorithm can produce encrypted versions of these,

```
> fit <- CRF.fit(forest, X, y, resamp=8, pk=keys$pk)
```

**iii) Prediction**

Imagine now that there are \(N’\) test observations for the toy example, similarly stored in an unencrypted or encrypted \(N’ \times P\) matrix, `newX`

. Then prediction proceeds using the forest and fit from the previous commands:

```
> pred <- CRF.pred(forest, fit, newX)
```

For similar reasons to above, the public key is required when computing with encrypted data,

```
> pred <- CRF.pred(forest, fit, newX, pk=keys$pk)
```

Upon completion, `pred`

will be an \(N’ \times 2\) matrix with votes for class \(j\) on observation \(i\) being in the \((i,j)\)th entry.

### Semi-parametric naïve Bayes (SNB)

There are two steps in working with the SNB classifier in the package: i) fit a model to the data, i.e., estimate the factors that make up the parameters \({ \hat{\theta}, \hat{\alpha} , \hat{\beta} }\); ii) predict a new observation.

The user again has the choice of working with encrypted or unencrypted data (this is automatically detected), with the provision that both steps must use the same type of data, i.e., either both use encrypted data or both use unecrypted data.

Assume we have an \(N \times P\) design matrix `X`

and a length \(N\) binary response vector `y`

; and let `cX`

and `cy`

denote the corresponding ciphertexts, encrypted with the package `HomomorphicEncryption`

.

**i) Estimation**

Besides the data there are only two parameters in the estimation step: `paired`

controls whether paired or unpaired optimisation is to be used; and `pk`

is the public key used to encrypt `cX`

and `cy`

– it is used to encrypt some constants required in the estimation process and is only needed if encrypted data is used.

Estimating the class counts \({ \text{counts}(y=0), \text{counts}(y=1) }\) and the coefficients \({ a_{j}, b_{j}, d_{j}}\) and is then straightforward using the fitting function `SNB.fit`

. Say \(N=10\) and \(P=2\); then, with unencrypted data we obtain

```
> snb1.fit <- SNB.fit(X, y, paired=T)
> snb1.fit
$counts.y
[1] 4 6
$coeffs
[,1] [,2]
a 110 462
b -8 -120
d 221 105
```

while with encrypted data we obtain

```
> snb2.fit <- SNB.fit(cX, cy, paired=T, pk=keys$pk)
> snb2.fit
$counts.y
Vector of 2 Fan and Vercauteren cipher texts
$coeffs
Matrix of 3 x 2 Fan and Vercauteren cipher texts
```

**ii) Prediction**

The function `SNB.fit`

generates an object of class `SNB`

, which can be directly fed into the base R `predict`

function.
Apart from the fitted `model`

and the testing data `newX`

, there are only two parameters:
`type`

controls the type of prediction to the returned (raw coefficients or conditional class probabilities);
`sk`

is the secret key – required if predictions are to be decrypted.

Continuing the example above, we can obtain the probabilities of class \(y=1\),

```
> snb2.predProb <- predict(model=snb.fit2, newX=cX, type="prob", sk=keys$sk)
> snb2.predProb
[1] 0.2212818 0.1973391 0.7221404 0.8975560 0.7293452 0.4353264 0.4621838
[8] 0.4442447 0.7364313 0.8907048
```

or obtain the class counts and coefficients \({ e_{ij}, d_{j} }\) for all testing observations,

```
> snb2.predRaw <- predict(model=snb.fit2, newX=cX, type="raw", sk=keys$sk)
> snb2.predRaw
$counts.y
[1] 4 6
$coeffs.e
[,1] [,2]
[1,] 102 -138
[2,] 70 -138
[3,] 86 102
[4,] 102 222
[5,] 94 102
[6,] 70 -18
[7,] 94 -18
[8,] 78 -18
[9,] 102 102
[10,] 86 222
$coeffs.d
[1] 221 105
```

## Citation information

If you use this package, please use the following citation:

Aslett, L. J. M., Esperança, P. M. and Holmes, C. C.(2015), Encrypted statistical machine learning: new privacy preserving methods. Technical report, University of Oxford.

```
@techreport{Aslett2015,
Author = {L. J. M. Aslett and P. M. Esperan{\c c}a and C. C. Holmes},
Institution = {University of Oxford},
Title = {Encrypted statistical machine learning: new privacy preserving methods},
Year = {2015}}
```

## References

Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015a), ‘A review of homomorphic encryption and software tools for encrypted statistical machine learning’, *arXiv*.

Aslett, L. J. M., Esperança, P. & Holmes, C. C.(2015b), ‘Encrypted statistical machine learning: new privacy preserving methods’, *arXiv*.

Gentry, C. (2010), ‘Computing arbitrary functions of encrypted data’, *Communications of the ACM* **53**(3), 97–105.

R Core Team (2015), *R: A language and environment for statistical computing*. R Foundation for
Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/

## Share |
## Follow |