Commit b872351f by Sijie Xiong

Added ipython notebooks

parent e5dfb420
File added
......@@ -5,12 +5,33 @@
"metadata": {},
"source": [
"# A brief introduction to differential privacy\n",
"---\n",
"\n",
"The goal of these tutorials is to give a hands-on introduction to *differential privacy*, a framework for thinking about the privacy risks inherent when doing statistics or data analytics on private or sensitive data. Many approaches to protecting data privacy seek to \"anonymize\" the data by removing obvious (or not so obvious) identifiers. For example, a data set might have names, addresses, social security numbers, and other personally identifying information removed. However, that does not guarantee that publishing a stripped-down data set is still safe -- there have been many well-publicized attacks on supposedly \"sanitized\" data that use a small amount of auxiliary (and sometimes public) information to re-identify individuals in the data set.\n",
"\n",
"The fundamental difficulty in these examples is that the *data itself is uniquely identifying*. The follow-on implication is that if we publish the output of a program (say, a statistical analysis method) that runs on private data, we *reveal something about the individuals in the data*. The *differential privacy* model is a way to quantify this additional risk of re-identification. Privacy is a property of the *algorithm that operates on the data*; different algorithms incur different *privacy risks*.\n",
"\n",
"Differential privacy was first proposed in a paper by Dwork, McSherry, Nissim, and Smith in 2006 [DMNS06]. In the intervening years there has been a rapid growth in the research literature on differentially private approaches for many statistical, data mining, and machine learning algorithms of interest. The goal of this package is to provide easy-to-use implementations of these methods as well as tutorials (via ipython notebooks) to show how to use these methods."
"Differential privacy was first proposed in a paper by Dwork, McSherry, Nissim, and Smith in 2006 [DMNS06]. In the intervening years there has been a rapid growth in the research literature on differentially private approaches for many statistical, data mining, and machine learning algorithms of interest. The goal of this package is to provide easy-to-use implementations of these methods as well as tutorials (via ipython notebooks) to show how to use these methods.\n",
"\n",
"## Definition of Differential Privacy\n",
"\n",
"An algorithm $\\mathcal{A}$ taking values in a set $S$ provides $(\\epsilon,\\delta)$-differential privacy if\n",
"$$\\text{Pr}(\\mathcal{A}(D) \\in S) \\leq e^{\\epsilon} \\text{Pr}(\\mathcal{A}(D') \\in S) + \\delta$$\n",
"for all measurable $S \\subseteq \\mathcal{S}$ and all data sets $D$ and $D'$ differing in a single entry [DR14]. \n",
"\n",
"This definition essentially states that the probability of the output of an algorithm is not changed significantly if the corresponding database input is changed by just one entry. Here, $\\epsilon$ and $\\delta$ are privacy parameters, where low $\\epsilon$ and $\\delta$ ensure more privacy. It should be noted here that the parameter $\\delta$ can be interpreted as the probability that the algorithm fails. Therefore, an $(\\epsilon,0)$-differentially private algorithm guarantees much stronger privacy than an $(\\epsilon,\\delta)$-differentially private algorithm, where $\\delta > 0$. We refer to $(\\epsilon,0)$ differential privacy as $\\epsilon$-differential privacy.\n",
"\n",
"### Why do we need differential privacy?\n",
"\n",
"Consider a database of salaries of 5 people and an algorithm $f$ that outputs the average salary of the database. Let us assume that there is an adversary who can only observe the output of the algorithm. To be more specific, consider the following collection of salaries\n",
"\n",
"$$X = [100 \\ \\ 120 \\ \\ 110 \\ \\ 130 \\ \\ 140]\\ \\ \\Rightarrow\\ \\ f(X) = 120$$\n",
"\n",
"Now, let us assume that we add another individual to our collection of salaries and his/her salary is 1000. If we compute the output of the algorithm, we, along with the adversary, would observe that the average salary has significantly increased - indicating that there is a high salary drawing person in the database. \n",
"\n",
"$$ X' = [100 \\ \\ 120 \\ \\ 110 \\ \\ 130 \\ \\ 140 \\ \\ 1000]\\ \\ \\Rightarrow\\ \\ f(X') = 266.67$$\n",
"\n",
"This situation maybe unwanted to the individuals in the collection. Differential privacy *modifies* the algorithm in such a way that this difference in the output of the algorithm is suppressed. More formally, we are interested in the *sensitivity* of the function in consideration and we need to add noise scaled to the sensitivity."
]
},
{
......@@ -18,8 +39,11 @@
"metadata": {},
"source": [
"## References\n",
"---\n",
"\n",
"[DMNS06] Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography. Lecture notes in computer science, Vol. 3876, eds S. Halevi and T. Rabin (Berlin, Heidelberg: Springer), 265–284.\n",
"\n",
"[DMNS06] Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography. Lecture notes in computer science, Vol. 3876, eds S. Halevi and T. Rabin (Berlin, Heidelberg: Springer), 265–284."
"[DR14] Dwork, C., and Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407."
]
},
{
......@@ -33,22 +57,23 @@
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python [Root]",
"language": "python",
"name": "python2"
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
......
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<script>\n",
" function code_toggle() {\n",
" if (code_shown){\n",
" $('div.input').hide('500');\n",
" $('#toggleButton').val('Show Code')\n",
" } else {\n",
" $('div.input').show('500');\n",
" $('#toggleButton').val('Hide Code')\n",
" }\n",
" code_shown = !code_shown\n",
" }\n",
"\n",
" $( document ).ready(function(){\n",
" code_shown=false;\n",
" $('div.input').hide()\n",
" });\n",
"</script>\n",
"<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
"<script>\n",
" function code_toggle() {\n",
" if (code_shown){\n",
" $('div.input').hide('500');\n",
" $('#toggleButton').val('Show Code')\n",
" } else {\n",
" $('div.input').show('500');\n",
" $('#toggleButton').val('Hide Code')\n",
" }\n",
" code_shown = !code_shown\n",
" }\n",
"\n",
" $( document ).ready(function(){\n",
" code_shown=false;\n",
" $('div.input').hide()\n",
" });\n",
"</script>\n",
"<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Differentially Private Mean\n",
"---\n",
"\n",
"The following tutorial gives one example of how the `dp_mean()` function is called. The data samples are randomly drawn from a Gaussian distribution. The output of the `dp_mean()` function will be compared to a non-differentially private version of the sample mean: $\\bar{x}=\\frac{1}{n}\\sum_{i=1}^{n}x_i$. \n",
"\n",
"The parameters that can be adjusted are:\n",
"\n",
"- Epsilon\n",
"- Delta\n",
"- Sample_size"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Non-private Mean: 0.6688'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Differentially Private Mean: 0.6635'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.show_mean>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from ipywidgets import interact\n",
"from IPython.display import display\n",
"import numpy as np\n",
"import dp_stats as dps\n",
"\n",
"# This tutorial gives an example of using the dp_mean() function\n",
"# The true sample mean and differentially private mean of the data vector will be displayed for comparison\n",
"\n",
"\n",
"# This function will allow the outputs of the means to be interactive\n",
"def show_mean(Epsilon=1.0, Delta = 0.1, Sample_size = 100):\n",
" # generate a sample data vector\n",
" data_ = np.random.normal(loc = 0, scale = 1.0, size = Sample_size)\n",
" \n",
" # restric data vector to be positive and within the range [0, 1]\n",
" data_ = abs(data_)\n",
" data_ = data_.clip(min = 0, max = 1.0)\n",
"\n",
" # find the non-differentially private mean of the generated data\n",
" mean_control = (np.sum(data_) * 1.0) / (Sample_size * 1.0)\n",
" \n",
" # find the differentially private mean of the generated data\n",
" # dp_mean( data_vect, epsilon=1.0, delta=0.1 )\n",
" mean_dp = dps.dp_mean(data_, epsilon = Epsilon, delta = Delta)\n",
" \n",
" # output the control and differentially private mean\n",
" control_txt = 'Non-private Mean: {}'.format(round(mean_control, 4))\n",
" display(control_txt)\n",
" dp_txt = 'Differentially Private Mean: {}'.format(round(float(mean_dp), 4))\n",
" display(dp_txt)\n",
"\n",
"interact(show_mean, Epsilon=(0.01,3,0.01), Delta=(0.01,0.5,0.01), Sample_size=(100,10000,500))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It can be noted from the outputs that the differentially private mean will roughly come closer to the actual sample mean when the sample size becomes larger with fixed privacy level, or the privacy level becomes small (Epsilon being large) with fixed sample size."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<script>\n",
" $(document).ready(function(){\n",
" $('div.prompt').hide();\n",
" $('div.back-to-top').hide();\n",
" $('nav#menubar').hide();\n",
" $('.breadcrumb').hide();\n",
" $('.hidden-print').hide();\n",
" });\n",
"</script>\n",
"\n",
"<footer id=\"attribution\" style=\"float:right; color:#999; background:#fff;\">\n",
"Created with Jupyter, delivered by Fastly, rendered by Rackspace.\n",
"</footer>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
"<script>\n",
" $(document).ready(function(){\n",
" $('div.prompt').hide();\n",
" $('div.back-to-top').hide();\n",
" $('nav#menubar').hide();\n",
" $('.breadcrumb').hide();\n",
" $('.hidden-print').hide();\n",
" });\n",
"</script>\n",
"\n",
"<footer id=\"attribution\" style=\"float:right; color:#999; background:#fff;\">\n",
"Created with Jupyter, delivered by Fastly, rendered by Rackspace.\n",
"</footer>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<script>\n",
" function code_toggle() {\n",
" if (code_shown){\n",
" $('div.input').hide('500');\n",
" $('#toggleButton').val('Show Code')\n",
" } else {\n",
" $('div.input').show('500');\n",
" $('#toggleButton').val('Hide Code')\n",
" }\n",
" code_shown = !code_shown\n",
" }\n",
"\n",
" $( document ).ready(function(){\n",
" code_shown=false;\n",
" $('div.input').hide()\n",
" });\n",
"</script>\n",
"<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
"<script>\n",
" function code_toggle() {\n",
" if (code_shown){\n",
" $('div.input').hide('500');\n",
" $('#toggleButton').val('Show Code')\n",
" } else {\n",
" $('div.input').show('500');\n",
" $('#toggleButton').val('Hide Code')\n",
" }\n",
" code_shown = !code_shown\n",
" }\n",
"\n",
" $( document ).ready(function(){\n",
" code_shown=false;\n",
" $('div.input').hide()\n",
" });\n",
"</script>\n",
"<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Differentially Private PCA\n",
"---\n",
"\n",
"The following tutorial gives one example of how the `dp_pca()` funciton is called. The data samples are randomly drawn i.i.d. from a multivariate Gaussian distribution with a pre-defined mean and covariance matrix. The quality (in terms of the captured energy of the covariance matrix in the reduced dimensional subspace) of the output subspace of the differentially private PCA and non-differentially private PCA is shown as a comparison. \n",
"\n",
"The parameters that can be adjusted are:\n",
"\n",
"- Epsilon\n",
"- Sample_size"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Non-private Quality: 26.9546'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Differentially Private Quality: 26.9547'"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ipywidgets import interact\n",
"from IPython.display import display\n",
"\n",
"# This tutorial gives an example of one way to use the differentially private PCA function\n",
"# A non-differentially private version of the PCA process will also be run to generate the likeness of the two\n",
"\n",
"\n",
"# This function will be used to randomly generate a data matrix from a multivariate Gaussian distribution\n",
"def gen_data(Sample_size, k):\n",
" \"\"\"\n",
" Inputs:\n",
" Sample_size: total number of test samples to return in data matrix\n",
" Outputs:\n",
" trn_data: [trn_size x d]\n",
" A: covariance matrix, [d x d]\n",
" \"\"\"\n",
" \n",
" import numpy as np\n",
"\n",
" d = 10 # features\n",
" n = Sample_size # number of samples to generate for each class\n",
"\n",
" # create covariance matrix\n",
" A = np.zeros((d,d))\n",
" for i in range(d):\n",
" if i < k:\n",
" A[i,i] = d - i\n",
" else:\n",
" A[i, i] = 1\n",
"\n",
" # create mean\n",
" mean = np.zeros(d)\n",
"\n",
" # generate n samples\n",
" data_ = np.random.multivariate_normal(mean, A, n) # [nxd]\n",
"\n",
" return data_, A\n",
"\n",
"# This function will allow the PCA outputs to be interactive\n",
"def show_pca_qual(Sample_size, k = 5, Epsilon = 1.0):\n",
" import numpy as np\n",
" import dp_stats as dps\n",
" \n",
" # generate the data matrix\n",
" data_, A = gen_data(Sample_size, k) # data_: samples are in rows, A: covariance matrix\n",
" \n",
" # go through the non-differentially private PCA routine\n",
" sigma_control = np.dot(data_.T, data_) # [d x d] = [d x Sample_size] [Sample_size x d]\n",
" U, S, V = np.linalg.svd(sigma_control)\n",
" \n",
" # grab the first k columns\n",
" U_reduc = U[:, :k]\n",
" \n",
" # find the quality of the PCA control\n",
" control_quality = np.trace(np.dot(np.dot(U_reduc.T, A), U_reduc))\n",
" \n",
" \n",
" # go through the differentially private PCA routine\n",
" # dp_pca_sn ( data, epsilon=1.0 ) // samples must be in columns\n",
" sigma_dp = dps.dp_pca_sn(data_.T, epsilon = Epsilon)\n",
" U_dp, S_dp, V_dp = np.linalg.svd(sigma_dp)\n",
" \n",
" # grab the first k columns\n",
" U_dp_reduc = U_dp[:, :k]\n",
" \n",
" # find the quality of the differentially private PCA method\n",
" dp_quality = np.trace(np.dot(np.dot(U_dp_reduc.T, A), U_dp_reduc))\n",
" \n",
" # output the results\n",
" control_txt = \"Non-private Quality: {}\".format(round(control_quality, 4))\n",
" display(control_txt)\n",
" dp_txt = \"Differentially Private Quality: {}\".format(round(float(dp_quality), 4))\n",
" display(dp_txt)\n",
"\n",
"interact(show_pca_qual, Sample_size=(50,1000,100), k=(1, 10, 1), Epsilon=(0.01,3.0,0.01))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<script>\n",
" $(document).ready(function(){\n",
" $('div.prompt').hide();\n",
" $('div.back-to-top').hide();\n",
" $('nav#menubar').hide();\n",
" $('.breadcrumb').hide();\n",
" $('.hidden-print').hide();\n",
" });\n",
"</script>\n",
"\n",
"<footer id=\"attribution\" style=\"float:right; color:#999; background:#fff;\">\n",
"Created with Jupyter, delivered by Fastly, rendered by Rackspace.\n",
"</footer>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
"<script>\n",
" $(document).ready(function(){\n",
" $('div.prompt').hide();\n",
" $('div.back-to-top').hide();\n",
" $('nav#menubar').hide();\n",
" $('.breadcrumb').hide();\n",
" $('.hidden-print').hide();\n",
" });\n",
"</script>\n",
"\n",
"<footer id=\"attribution\" style=\"float:right; color:#999; background:#fff;\">\n",
"Created with Jupyter, delivered by Fastly, rendered by Rackspace.\n",
"</footer>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment