Commit 39c239b8 by Sijie Xiong

Added dp_stats files

parents
# dp-stats authors
This project is primarily maintained by researchers at the Department of Electrical and Computer Engineering at Rutgers, the State University of New Jersey.
## Primary authors
* [Sijie Xiong](https://gitlab.com/u/sx37)
* [Hafiz Imtiaz](https://gitlab.com/u/hafizimtiaz)
## Additional contributors
* [Anand D. Sarwate](https://gitlab.com/u/asarwate)
* [Liyang Xie](https://sites.google.com/site/xieliyang66/)
* [Dean Coco](https://gitlab.com/u/dacoco)
The MIT License (MIT)
Copyright (c) 2016 dp-stats
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# dp-stats
dp-stats is a Python library for differentially private statistics and machine learning algorithms
## Contact
* Subscribe to our mailing list: [dp_stats@email.rutgers.edu](https://email.rutgers.edu/mailman/listinfo/dp_stats)
## Dependencies
dp-stats has the following dependencies:
* Python 3.5
* Numpy 1.10.4
* Scipy 0.17.0
## Downloading
You can download the repository at https://gitlab.com/dp-stats/dp-stats.git, or using
```
$ git clone https://gitlab.com/dp-stats/dp-stats.git
```
## Installation
To install:
```
$ cd /path/to/dp_stats
```
In the directory run either of the following:
```
$ python setup.py install
```
To use in your programs:
```
import dp_stats as dps
```
## Contributing
This package is in alpha, so bug reports, especially regarding implementation and parameter setting, are very welcome. If you would like to become a developer, feel free to contact the authors.
Requests for additional features/algorithms are also welcome, as are requests for tutorials.
## Testing
Please run the following code to see if the installation is correct.
```
import numpy as np
import dp_stats as dps
### example of mean and variance
x = np.random.rand(10)
x_mu = dps.dp_mean( x, 1.0, 0.1 )
x_vr = dps.dp_var( x, 1.0, 0.1 )
print(x_mu)
print(x_vr)
### example of DP-PCA
d = 10 # data dimension
n = 100 # number of samples
k = 5 # true rank
### create covariance matrix
A = np.zeros((d,d))
for i in range(d):
if i < k:
A[i,i] = d - i
else:
A[i, i] = 1
mean = 0.0 * np.ones(d) # true mean of the samples
### generate n samples
samps = np.random.multivariate_normal(mean, A, n) # [nxd]
sigma = np.dot(samps.T, samps) # sample covariance matrix
U,S,V = np.linalg.svd(sigma, full_matrices=True)
U_reduce = U[:,:k]
quality = np.trace(np.dot(np.dot(U_reduce.T,A),U_reduce))
print(quality)
sigma_dp = dps.dp_pca_sn(samps.T, epsilon = 0.1)
U_dp, S_dp, V_dp = np.linalg.svd(sigma_dp, full_matrices=True)
U_dp_reduce = U_dp[:,:k]
quality_dp = np.trace(np.dot(np.dot(U_dp_reduce.T,A),U_dp_reduce))
print(quality_dp)
```
## License
MIT license
## Acknowledgements
This development of this package was supported by support from the
following sources:
* National Science Foundation under award CCF-1453432
* National Institutes of Health under award 1R01DA040487-01A1
* Defense Advanced Research Projects Agency (DARPA) and Space and Naval Warfare Systems Center, Pacific (SSC Pacific) under contract No. N66001-15-C-4070.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA or SSC Pacific.
__author__ = 'Sijie, Hafiz'
from .dp_stats import *
from .dp_svm import *
from .dp_lr import *
\ No newline at end of file
import numpy as np
from scipy.optimize import minimize
def noisevector( scale, Length ):
r1 = np.random.normal(0, 1, Length)#standard normal distribution
n1 = np.linalg.norm( r1, 2 )#get the norm of this random vector
r2 = r1 / n1#the norm of r2 is 1
normn = np.random.gamma( Length, 1/scale, 1 )#Generate the norm of noise according to gamma distribution
res = r2 * normn#get the result noise vector
return res
def lr( z ):
logr = np.log( 1 + np.exp( -z ) )
return logr
def lr_output_train( data, labels, epsilon, Lambda ):
L = len( labels )
l = len( data[0] )#length of a data point
scale = L * Lambda * epsilon / 2#chaudhuri2011differentially corollary 11, part 1
noise = noisevector( scale )
x0 = np.zeros( l )#starting point with same length as any data point
def obj_func(x):
jfd = lr( labels[0] * np.dot( data[0] , x ) )
for i in range( 1, L ):
jfd = jfd + lr( labels[i] * np.dot( data[i], x ) )
f = (1/L) * jfd + (1/2) * Lambda * ( np.linalg.norm(x)**2 )
return f
#minimization procedure
f = minimize( obj_func, x0, method='Nelder-Mead').x#empirical risk minimization using scipy.optimize minimize function
fpriv = f + noise
return fpriv
def lr_objective_train(data, labels, epsilon, Lambda ):
#parameters in objective perturbation method
c = 1 / 4#chaudhuri2011differentially corollary 11, part 2
L = len( labels )#number of data points in the data set
l = len( data[0] )#length of a data point
x0 = np.zeros( l )#starting point with same length as any data point
Epsilonp = epsilon - 2 * np.log( 1 + c / ( Lambda * L ) )
if Epsilonp > 0:
Delta = 0
else:
Delta = c / ( L * ( np.exp( epsilon / 4 ) - 1 ) ) - Lambda
Epsilonp = epsilon / 2
scale = Epsilonp / 2
noise = noisevector( scale, l )
def obj_func( x ):
jfd = lr( labels[0] * np.dot( data[0], x ) )
for i in range( 1, L ):
jfd = jfd + lr( labels[i] * np.dot( data[i], x ) )
f = (1/L) * jfd + (1/2) * Lambda * ( np.linalg.norm(x)**2 ) + (1/L) * np.dot(noise, x) + (1/2) * Delta * (np.linalg.norm(x)**2)
return f
#minimization procedure
fpriv = minimize(obj_func, x0, method='Nelder-Mead').x#empirical risk minimization using scipy.optimize minimize function
return fpriv
def dp_lr(data, labels, method='obj', epsilon=0.1, Lambda = 0.01 ):
'''
This function provides a differentially-private estimate of the logistic regression classifier according to
Sarwate et al. 2011, "Differentially Private Empirical Risk Minimization" paper.
Input:
data = data matrix, samples are in rows
labels = labels of the data samples
method = 'obj' (for objective perturbation) or 'out' (for output perturbation)
epsilon = privacy parameter, default 1.0
Lambda = regularization parameter
Output:
fpriv = (\epsilon)-differentially-private estimate of the svm classifier
Example:
>>> import numpy as np
>>> import dp_stats as dps
>>> X = np.random.normal(1.0, 1.0, (n,d));
>>> Y = np.random.normal(-1.0, 1.0, (n,d));
>>> labelX = 1.0 * np.ones(n);
>>> labelY = -1.0 * np.ones(n);
>>> data = np.vstack((X,Y));
>>> labels = np.hstack((labelX,labelY));
>>> fpriv = dps.classification.dp_svm (data, labels, 'obj', 0.1, 0.01, 0.5)
[ 1.45343603 6.59613827 3.39968451 0.56048388 0.69090816 1.7477234
-1.50873385 -2.06471724 -1.55284441 4.03065254]
'''
if epsilon < 0.0:
print('ERROR: Epsilon should be positive.')
return
else:
if method == 'obj':
fpriv = lr_objective_train(data, labels, epsilon, Lambda )
else:
fpriv = lr_output_train( data, labels, epsilon, Lambda )
return fpriv
\ No newline at end of file
def dp_pca_ag ( data, epsilon=1.0, delta=0.1 ):
'''
This function provides a differentially-private estimate using Analyze Gauss method
of the second moment matrix of the data
Input:
data = data matrix, samples are in columns
epsilon = privacy parameter, defaul
hat_A = (\epsilon, \delta)-differentially-private estimate of A = data*data'
Example:
>>> import numpy as np
>>> import dp_stats as dps
>>> data = np.random.rand(10)
>>> hat_A = dps.dp_pca_ag ( data, 1.0, 0.1 )
[[ 1.54704321 2.58597112 1.05587101 0.97735922 0.03357301]
[ 2.58597112 4.86708836 1.90975259 1.41030773 0.06620355]
[ 1.05587101 1.90975259 1.45824498 -0.12231379 -0.83844168]
[ 0.97735922 1.41030773 -0.12231379 1.47130207 0.91925544]
[ 0.03357301 0.06620355 -0.83844168 0.91925544 1.06881321]]
'''
import numpy as np
if any( np.diag( np.dot( data.transpose(), data ) ) ) > 1:
print('ERROR: Each column in the data matrix should have 2-norm bounded in [0,1].')
return
elif epsilon < 0.0:
print('ERROR: Epsilon should be positive.')
return
elif delta < 0.0 or delta > 1.0:
print('ERROR: Delta should be bounded in [0,1].')
return
else:
A = np.dot( data, data.transpose() )
D = ( 1.0 / epsilon ) * np.sqrt( 2.0 * np.log( 1.25 / delta ) )
m = len(A)
temp = np.random.normal( 0, D, (m, m))
temp2 = np.triu( temp )
temp3 = temp2.transpose()
temp4 = np.tril(temp3, -1)
E = temp2 + temp4
hat_A = A + E
return hat_A
def dp_pca_sn ( data, epsilon = 1.0 ):
'''
This function provides a differentially-private estimate using Symmetric Noise method
of the second moment matrix of the data
Input:
data = data matrix, samples are in columns
epsilon = privacy parameter, default 1.0
Output:
hat_A = (\epsilon, \delta)-differentially-private estimate of A = data*data'
Example:
>>> import numpy as np
>>> import dp_stats as dps
>>> data = np.random.rand(10)
>>> hat_A = dps.dp_pca_sn ( data, 1.0 )
[[ 1.54704321 2.58597112 1.05587101 0.97735922 0.03357301]
[ 2.58597112 4.86708836 1.90975259 1.41030773 0.06620355]
[ 1.05587101 1.90975259 1.45824498 -0.12231379 -0.83844168]
[ 0.97735922 1.41030773 -0.12231379 1.47130207 0.91925544]
[ 0.03357301 0.06620355 -0.83844168 0.91925544 1.06881321]]
'''
import numpy as np
if any( np.diag( np.dot( data.transpose(), data ) ) ) > 1:
print('ERROR: Each column in the data matrix should have 2-norm bounded in [0,1].')
return
elif epsilon < 0.0:
print('ERROR: Epsilon should be positive.')
return
else:
A = np.dot( data, data.transpose() )
d = len(A)
nsamples = d + 1
sigma = ( 1.0 / ( 2.0 * epsilon ) )
Z_mean = 0.0
Z = np.random.normal(Z_mean, sigma, (d,nsamples))
E = np.dot( Z, Z.transpose() )
hat_A = A + E
return hat_A
def dp_pca_ppm ( data, k, Xinit, epsilon=1.0,delta=0.1 ):
'''
This function provides a differentially-private estimate using Private Power method
of the second moment matrix of the data
Input:
data = data matrix, samples are in columns
k = reduced dimension
Xinit = d x k size, initialization for the sampling
epsilon = privacy parameter, default 1.0
delta = privacy parameter, default 0.1
Output:
X = (\epsilon, \delta)-differentially-private estimate of the top-k subspace of A = data*data'
Example:
>>> import numpy as np
>>> import dp_stats as dps
>>> data = np.random.rand(10)
>>> hat_A = dps.dp_pca_ppm ( data, 1.0, 0.1 )
[[ 1.54704321 2.58597112 1.05587101 0.97735922 0.03357301]
[ 2.58597112 4.86708836 1.90975259 1.41030773 0.06620355]
[ 1.05587101 1.90975259 1.45824498 -0.12231379 -0.83844168]
[ 0.97735922 1.41030773 -0.12231379 1.47130207 0.91925544]
[ 0.03357301 0.06620355 -0.83844168 0.91925544 1.06881321]]
'''
import numpy as np
if any( np.diag( np.dot( data.transpose(), data ) ) ) > 1:
print('ERROR: Each column in the data matrix should have 2-norm bounded in [0,1].')
return
elif epsilon < 0.0:
print('ERROR: Epsilon should be positive.')
return
elif delta < 0.0 or delta > 1.0:
print('ERROR: Delta should be bounded in [0,1].')
return
else:
A = np.dot( data, data.transpose() )
d = np.size( A, 0 )
U, S, V = np.linalg.svd( A )
param = S[k-1] * np.log( d ) / ( S[k-1] - S[k] )
L = round( 10 * param )
sigma = ( 1.0 / epsilon ) * np.sqrt( 4.0 * k * L * np.log( 1.0 / delta ) )
x_old = Xinit
count = 0
while count <= L:
G_new = np.random.normal( 0, np.linalg.norm( x_old, np.inf ) * sigma, (d, k))
Y = np.dot(A, x_old) + G_new
count += 1
q, r = np.linalg.qr(Y)
x_old = q[:, 0:k ]
X = x_old
return X
\ No newline at end of file
import numpy as np
#from statistics import stdev
# from pylab import norm
from scipy.optimize import minimize
def noisevector( scale, Length ):
r1 = np.random.normal(0, 1, Length)#standard normal distribution
n1 = np.linalg.norm( r1, 2 )#get the norm of this random vector
r2 = r1 / n1#the norm of r2 is 1
normn = np.random.gamma( Length, 1/scale, 1 )#Generate the norm of noise according to gamma distribution
res = r2 * normn#get the result noise vector
return res
def huber(z, h):#chaudhuri2011differentially corollary 21
if z > 1 + h:
hb = 0
elif np.fabs(1-z) <= h:
hb = (1 + h - z)**2 / (4 * h)
else:
hb = 1 - z
return hb
def svm_output_train(data, labels, epsilon, Lambda, h):
N = len( labels )
l = len( data[0] )#length of a data point
scale = N * Lambda * epsilon / 2
noise = noisevector( scale, l )
x0 = np.zeros(l)#starting point with same length as any data point
def obj_func(x):
jfd = huber( labels[0] * np.dot(data[0],x) )
for i in range(1,N):
jfd = jfd + huber( labels[i] * np.dot(data[i],x), h )
f = ( 1/N )*jfd + (1/2) * Lambda * ( np.linalg.norm(x,2)**2 )
return f
#minimization procedure
f = minimize(obj_func, x0, method='Nelder-Mead').x #empirical risk minimization using scipy.optimize minimize function
fpriv = f + noise
return fpriv
def svm_objective_train(data, labels, epsilon, Lambda, h):
#parameters in objective perturbation method
c = 1 / ( 2 * h )#chaudhuri2011differentially corollary 13
N = len( labels )#number of data points in the data set
l = len( data[0] )#length of a data point
x0 = np.zeros(l)#starting point with same length as any data point
Epsilonp = epsilon - 2 * np.log( 1 + c / (Lambda * N))
if Epsilonp > 0:
Delta = 0
else:
Delta = c / ( N * (np.exp(epsilon/4)-1) ) - Lambda
Epsilonp = epsilon / 2
noise = noisevector(Epsilonp/2, l)
def obj_func(x):
jfd = huber( labels[0] * np.dot(data[0], x), h)
for i in range(1,N):
jfd = jfd + huber( labels[i] * np.dot(data[i], x), h )
f = (1/N) * jfd + (1/2) * Lambda * (np.linalg.norm(x,2)**2) + (1/N) * np.dot(noise,x) + (1/2)*Delta*(np.linalg.norm(x,2)**2)
return f
#minimization procedure
fpriv = minimize(obj_func, x0, method='Nelder-Mead').x#empirical risk minimization using scipy.optimize minimize function
return fpriv
def dp_svm(data, labels, method='obj', epsilon=0.1, Lambda = 0.01, h = 0.5):
'''
This function provides a differentially-private estimate of the svm classifier according to Sarwate et al. 2011,
"Differentially Private Empirical Risk Minimization" paper.
Input:
data = data matrix, samples are in rows
labels = labels of the data samples
method = 'obj' (for objective perturbation) or 'out' (for output perturbation)
epsilon = privacy parameter, default 1.0
Lambda = regularization parameter
h = huber loss parameter
Output:
fpriv = (\epsilon)-differentially-private estimate of the svm classifier
Example:
>>> import numpy as np
>>> import dp_stats as dps
>>> X = np.random.normal(1.0, 1.0, (n,d));
>>> Y = np.random.normal(-1.0, 1.0, (n,d));
>>> labelX = 1.0 * np.ones(n);
>>> labelY = -1.0 * np.ones(n);
>>> data = np.vstack((X,Y));
>>> labels = np.hstack((labelX,labelY));
>>> fpriv = dps.classification.dp_svm (data, labels, 'obj', 0.1, 0.01, 0.5)
[ 9.23418189 2.63380995 -2.01654661 -1.19112074 17.32083386
3.37943017 -14.76815378 12.3119061 -1.82132988 24.03559848]
'''
import numpy as np
if epsilon < 0.0:
print('ERROR: Epsilon should be positive.')
return
else:
if method == 'obj':
fpriv = svm_objective_train(data, labels, epsilon, Lambda, h)
else:
fpriv = svm_output_train(data, labels, epsilon, Lambda, h)
return fpriv
\ No newline at end of file
__author__ = 'Sijie'
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A brief introduction to differential privacy\n",
"\n",
"The goal of these tutorials is to give a hands-on introduction to *differential privacy*, a framework for thinking about the privacy risks inherent when doing statistics or data analytics on private or sensitive data. Many approaches to protecting data privacy seek to \"anonymize\" the data by removing obvious (or not so obvious) identifiers. For example, a data set might have names, addresses, social security numbers, and other personally identifying information removed. However, that does not guarantee that publishing a stripped-down data set is still safe -- there have been many well-publicized attacks on supposedly \"sanitized\" data that use a small amount of auxiliary (and sometimes public) information to re-identify individuals in the data set.\n",
"\n",
"The fundamental difficulty in these examples is that the *data itself is uniquely identifying*. The follow-on implication is that if we publish the output of a program (say, a statistical analysis method) that runs on private data, we *reveal something about the individuals in the data*. The *differential privacy* model is a way to quantify this additional risk of re-identification. Privacy is a property of the *algorithm that operates on the data*; different algorithms incur different *privacy risks*. While we have a \n",
"\n",
"Differential privacy was first proposed in a paper by Dwork, McSherry, Nissim, and Smith in 2006 [DMNS06]. In the intervening years there has been a rapid growth in the research literature on differentially private approaches for many statistical, data mining, and machine learning algorithms of interest. The goal of this package is to provide easy-to-use implementations of these methods as well as tutorials (via ipython notebooks) to show how to use these methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"[DMNS06]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Metadata-Version: 1.0
Name: dp-stats
Version: 0.1
Summary: Differentially private statistics
Home-page: https://gitlab.com/sx37/dp-stats.git
Author: Anand Sarwate, Hafiz Imtiaz, Sijie Xiong
Author-email: anand.sarwate@rutgers.edu
License: Rutgers
Description: UNKNOWN
Platform: UNKNOWN
setup.py
dp_stats/__init__.py
dp_stats/dp_lr.py
dp_stats/dp_pca.py
dp_stats/dp_stats.py
dp_stats/dp_svm.py
dp_stats/test.py
dp_stats.egg-info/PKG-INFO
dp_stats.egg-info/SOURCES.txt
dp_stats.egg-info/dependency_links.txt
dp_stats.egg-info/not-zip-safe
dp_stats.egg-info/top_level.txt
\ No newline at end of file
__author__ = 'Sijie, Hafiz'
from .dp_stats import *
from .dp_svm import *
from .dp_lr import *
\ No newline at end of file
import numpy as np
from scipy.optimize import minimize
def noisevector( scale, Length ):
r1 = np.random.normal(0, 1, Length)#standard normal distribution
n1 = np.linalg.norm( r1, 2 )#get the norm of this random vector
r2 = r1 / n1#the norm of r2 is 1
normn = np.random.gamma( Length, 1/scale, 1 )#Generate the norm of noise according to gamma distribution
res = r2 * normn#get the result noise vector
return res
def lr( z ):
logr = np.log( 1 + np.exp( -z ) )
return logr
def lr_output_train( data, labels, epsilon, Lambda ):
L = len( labels )
l = len( data[0] )#length of a data point
scale = L * Lambda * epsilon / 2#chaudhuri2011differentially corollary 11, part 1
noise = noisevector( scale )
x0 = np.zeros( l )#starting point with same length as any data point
def obj_func(x):
jfd = lr( labels[0] * np.dot( data[0] , x ) )
for i in range( 1, L ):
jfd = jfd + lr( labels[i] * np.dot( data[i], x ) )
f = (1/L) * jfd + (1/2) * Lambda * ( np.linalg.norm(x)**2 )
return f
#minimization procedure
f = minimize( obj_func, x0, method='Nelder-Mead').x#empirical risk minimization using scipy.optimize minimize function
fpriv = f + noise
return fpriv
def lr_objective_train(data, labels, epsilon, Lambda ):
#parameters in objective perturbation method
c = 1 / 4#chaudhuri2011differentially corollary 11, part 2
L = len( labels )#number of data points in the data set
l = len( data[0] )#length of a data point
x0 = np.zeros( l )#starting point with same length as any data point
Epsilonp = epsilon - 2 * np.log( 1 + c / ( Lambda * L ) )
if Epsilonp > 0:
Delta = 0
else:
Delta = c / ( L * ( np.exp( epsilon / 4 ) - 1 ) ) - Lambda
Epsilonp = epsilon / 2
scale = Epsilonp / 2
noise = noisevector( scale, l )
def obj_func( x ):
jfd = lr( labels[0] * np.dot( data[0], x ) )
for i in range( 1, L ):
jfd = jfd + lr( labels[i] * np.dot( data[i], x ) )
f = (1/L) * jfd + (1/2) * Lambda * ( np.linalg.norm(x)**2 ) + (1/L) * np.dot(noise, x) + (1/2) * Delta * (np.linalg.norm(x)**2)
return f
#minimization procedure
fpriv = minimize(obj_func, x0, method='Nelder-Mead').x#empirical risk minimization using scipy.optimize minimize function
return fpriv
def dp_lr(data, labels, method='obj', epsilon=0.1, Lambda = 0.01 ):
'''
This function provides a differentially-private estimate of the logistic regression classifier according to
Sarwate et al. 2011, "Differentially Private Empirical Risk Minimization" paper.
Input:
data = data matrix, samples are in rows
labels = labels of the data samples
method = 'obj' (for objective perturbation) or 'out' (for output perturbation)
epsilon = privacy parameter, default 1.0
Lambda = regularization parameter
Output: