Differential privacy

Differential privacy

Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.

Contents

Situation

Consider a trusted party that holds a dataset of sensitive information (e.g. medical records, voter registration information, email usage) with the goal of providing global, statistical information about the data publicly available, while preserving the privacy of the users whose information the data set contains. Such a system is called a statistical database. The notion of indistinguishability,[1] later termed Differential Privacy,[2] formalizes the notion of "privacy" in statistical databases.

ε-differential privacy

The actions of the trusted server are modeled via a randomized algorithm \mathcal{A}\,\!. A randomized algorithm \mathcal{A}\,\! gives \epsilon\,\!-differential privacy if, when D_{1}\,\! and D_{2}\,\! are drawn at random from the pairs of datasets that differ on a single element

\Pr[\mathcal{A}(D_{1})\in S]\leq
\exp(\epsilon)\times\textbf{Pr}[\mathcal{A}(D_{2})\in S]\,\!

for all S\subseteq \mathrm{Range}(\mathcal{A})\,\!, where \mathrm{Range}(\mathcal{A})\,\! denotes the output range of the algorithm \mathcal{A}\,\!.

N.B.: Differential Privacy is a condition on the release mechanism and not on the dataset.

This means that for any two datasets which are close to one another (that is, which differ on a single element) a given differentially private algorithm \mathcal{A}\,\! will behave approximately same on both data sets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the query significantly.

For example, assume we have a database of medical records D_{1}\,\! where each record is a pair (Name,X), where X\in\{0,1\}\,\! denotes whether a person has diabetes or not. For example:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. As a side information he knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query Q(i)\,\! which returns the partial sum of first i\,\! rows of column X\,\! in the database. In order to find Chandler's diabetes status the adversary simply executes Q(5)-Q(4)\,\!. One striking feature this example highlights is: individual information can be compromised even without explicitly querying for the specific individual information.

Let us take this example a little further. Now we construct D_{2}\,\! by replacing (Chandler,1) with (Chandler,0). Let us call the release mechanism (which releases the output of Q(i)\,\!) as \mathcal{A}\,\!. We say \mathcal{A}\,\! is \epsilon\,\!-differentially private if it satisfies the definition, where S\,\! can be thought of as a singleton set (something like \{3.5\},\{4\}\,\! etc.) if the output function of \mathcal{A}\,\! is a Discrete Random Variable (i.e. has a probability mass function(pmf)); else if it is a Continuous Random Variable (i.e. has a probability density function(pdf)), then S\,\! can be thought to be a small range of reals (something like 3.5\leq\mathcal{A}(D_{1})\leq3.7\,\!).

In essence if such an \mathcal{A}\,\! exist then a particular individual's presence or absence in the database will not alter the distribution of the output of the query by a significant amount and thus assures privacy of individual information in an information theoretic sense.

Motivation

In the past, various ad-hoc approaches to anonymizing public records have failed when researchers managed to identify personal information by linking two or more separately innocuous databases. Two well-known instances of successful "Linkage Attacks" have been the Netflix Database and the Massachusetts Group Insurance Commission (GIC) medical encounter database.

Netflix Prize

Netflix has offered $1,000,000 prize for a 10% improvement in its recommendation system. Netflix has also released a training dataset for the competing developers to train their systems. While releasing this dataset they had provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. Netflix is not the only available movie rating portal on the web; there are many including IMDB. In IMDB also individuals can register and rate movies, moreover they have the option of not keeping their details anonymous. Narayanan and Shmatikov had cleverly linked the Netflix anonymized training database with the Imdb database (using the date of rating by a user) to partly de-anonymize the Netflix training database.[3] Thus clearly the individual information of a user was compromised.

Massachusetts Group Insurance Commission (GIC) medical encounter database

In this case[2] Latanya Sweeney from Carnegie Mellon University linked the anonymized GIC database (which retained the birthdate, sex, and zip code of each patient) and voter registration records to identify the medical record of the governor of Massachusetts.

Sensitivity

Getting back on the main stream discussion on Differential Privacy, the sensitivity [1] (\Delta f\,\! ) of a function f:\mathcal{D}\rightarrow\mathbb{R}^{d}\,\! is

\Delta f=\max_{D_1,D_2}||f(D_1)-f(D_2)||_{1}\,\!

for all D_1\,\!,D_2\,\! differing in at most one element, and D_{1},D_{2}\in\mathcal{D}\,\!.

To get more intuition into this let us return to the example of the medical database and a query Q(i)\,\! (which can also be seen as the function  f \,\! ) to find how many people in the first i\,\! rows of the database have diabetes. Clearly, if we change one of the entries in the database then the output of the query Q(i)\,\! will change by at most one. So, the sensitivity of this query is one. It so happens that there are techniques(which we will describe below) using which we can create a differentially private algorithm for functions with low sensitivity.

Laplace noise

Many differentially private algorithms rely on adding controlled noise[1] to functions with low sensitivity. We will elaborate this point by taking a special kind of noise (whose kernel is a Laplace distribution i.e. the probability density function \text{noise}(y)\propto \exp(-|y|/\lambda)\,\!, mean zero and standard deviation \lambda\,\!). Now in our case we define the output function of \mathcal{A}\,\! as a real valued function (called as the transcript output by \mathcal{A}\,\!) \mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\!, where Y \sim \text{Lap}(\lambda)\,\!\,\! and f\,\! is the original real valued query/function we plan to execute on the database. Now clearly \mathcal{T}_{\mathcal{A}}(x)\,\! can be considered to be a continuous random variable, where

\frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\!

which is atmost e^{\frac{|f(D_{1})-f(D_{2})|}{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\!. We can consider \frac{\Delta(f)}{\lambda}\,\! to be the privacy factor \epsilon\,\!. Thus \mathcal{T}\,\! follows a differentially private mechanism (as can be seen from the definition). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have \mathcal{A}\,\! as the \epsilon\,\!-differential private algorithm we need to have \lambda=1/\epsilon\,\!. Though we have used Laplacian noise here but we can use other forms of noises which also allows to create a differentially private mechanism, such as the Gaussian Noise (where of course a slight relaxation of the definition of differential privacy [2] is needed).

Composability

If we query an ε-differential privacy mechanism t times, the result would be \epsilon t-differentially private.[4]

More generally if there is mechanisms \mathcal{M}_1,\dots,\mathcal{M}_n which are \epsilon_1,\dots,\epsilon_n differentially private, respectively, then any function of them g(\mathcal{M}_1,\dots,\mathcal{M}_n) is (\sum\limits_{i=0}^{n} \epsilon_i)-differentially private.

Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in c rows, which amounts to adversary with arbitrary auxiliary information can know if c particular participants submitted their information. This can be achieved because if c items change, the probability dilation is bounded by \exp ( \epsilon c ) instead of \exp ( \epsilon ),[2] i.e. for D1 and D2 differing on c items:

\Pr[\mathcal{A}(D_{1})\in S]\leq
\exp(\epsilon c)\times\textbf{Pr}[\mathcal{A}(D_{2})\in S]\,\!

Thus setting ε instead to \epsilon/c achieves the desired result (protection of c items). In other words, instead of having each item ε-differentially private protected, now every group of c items is ε-differentially private protected (and each item is (\epsilon/c)-differentially private protected).

Proof idea

For three datasets D1, D2, and D3, such that D1 and D2 differ on one item, and D2 and D3 differ on one item (implicitly D1 and D3 differ on at most 2 items), the following holds for an ε-differentially private mechanism \mathcal{A}:

\Pr[\mathcal{A}(D_{1})\in S] \leq \exp(\epsilon)\times\textbf{Pr}[\mathcal{A}(D_{2})\in S]\,\!, and \Pr[\mathcal{A}(D_{2})\in S] \leq \exp(\epsilon)\times\textbf{Pr}[\mathcal{A}(D_{3})\in S]\,\!

hence:

\Pr[\mathcal{A}(D_{1})\in S] \leq \exp(\epsilon)\times ( \exp(\epsilon)\times\textbf{Pr}[\mathcal{A}(D_{3})\in S]) = \exp(2 \epsilon)\times\textbf{Pr}[\mathcal{A}(D_{3})\in S] \,\!

The proof can be extended to c instead of 2.

See also

Exponential mechanism (differential privacy) – a technique for designing differentially private algorithms

Notes

  1. ^ a b c Dwork, McSherry, Nissim and Smith, 2006.
  2. ^ a b c d Dwork, ICALP 2006.
  3. ^ Arvind Narayanan, Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets. In IEEE Symposium on Security and Privacy 2008, p. 111–125.
  4. ^ Theorem 1 in: Dwork C., Kenthapadi K., Mcsherry F., Mironov I., Naor M., Our data, ourselves: Privacy via distributed noise generation. Advances in Cryptology. 2006

References

  • Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith In Theory of Cryptography Conference (TCC), Springer, 2006.
  • Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12.

External links


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Privacy enhancing technologies — (PET) is a general term for a set of computer tools, applications and mechanisms which when integrated in online services or applications, or when used in conjunction with such services or applications allow online users to protect the privacy of …   Wikipedia

  • Differential cryptanalysis — is a general form of cryptanalysis applicable primarily to block ciphers, but also to stream ciphers and cryptographic hash functions. In the broadest sense, it is the study of how differences in an input can affect the resultant difference at… …   Wikipedia

  • Differential diagnosis — Intervention MeSH D003937 A differential diagnosis (sometimes abbreviated DDx, ddx, DD, D/Dx, or ΔΔ) is a systematic diagnostic method used to identify the presence of an entity where multiple alternatives are …   Wikipedia

  • Information privacy — Information privacy, or data privacy is the relationship between collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them. Privacy concerns exist wherever personally… …   Wikipedia

  • Computer surveillance — This article is about surreptitious monitoring of computer activity. For information on methods of preventing unauthorized access to computer data, see computer security. Computer surveillance is the act of performing surveillance of computer… …   Wikipedia

  • Lucifer (cipher) — In cryptography, Lucifer was the name given to several of the earliest civilian block ciphers, developed by Horst Feistel and his colleagues at IBM. Lucifer was a direct precursor to the Data Encryption Standard. One version, alternatively named… …   Wikipedia

  • Cryptography — Secret code redirects here. For the Aya Kamiki album, see Secret Code. Symmetric key cryptography, where the same key is used both for encryption and decryption …   Wikipedia

  • Mitsubishi Lancer Evolution — Manufacturer Mitsubishi Motors Production 1992–present Assembly Mizushima Plant, Kurashiki, Okayama …   Wikipedia

  • Smart card — This article is regarding smart cards that use electrical connectors to transmit data. For smart cards that use radio see contactless smart card Contact type smart cards may have many different contact pad layouts, such as these SIMs A smart card …   Wikipedia

  • Backup — For other uses of Backup , see Backup (disambiguation). In information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. The verb form is back up in two… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”