Privacy, Accuracy and Cost in Record Matching with Hybrid Technique

Abstract

Real-world entities are not always represented by the same set of features in different data sets. Therefore, matching records of the same real-world entity distributed across these data sets is a challenging task. If the data sets contain private information, the problem becomes even more difficult.Existing solutions to this problem generally follow two approaches: sanitization techniques and cryptographic techniques. A hybrid technique that combines these two approaches and enables users to trade-off between privacy, accuracy and cost. The project’s main contribution is the use of a blocking phase that operates over sanitized data to filter out in a privacy-preserving manner pairs of records that do not satisfy the matching condition. This method incurs considerably lower costs than existing cryptographic techniques and yields significantly more accurate matching results compared to existing sanitization techniques, even when privacy requirements are high.

1.INTRODUCTION

Analysis and integration of information maintained by distinct entities is critical for many applications. For instance, two competitor businesses may wish to share information about customers with similar demographics (e.g., age, zip code) to increase their revenues (e.g., to jointly support a location-customized service for young subscribers). However, to protect their customer base, both parties want to keep data that are not part of the join result private.

1.1 DATA MINING

Data mining is a recently emerging field, connecting the three worlds of Databases, Artificial Intelligence and Statistics. The information age has enabled many organizations to gather large volumes of data. However, the usefulness of this data is negligible if “meaningful information” or “knowledge” cannot be extracted from it. Data mining, otherwise known as knowledge discovery, attempts to answer this need. In contrast to standard statistical methods, data mining techniques search for interesting information without demanding a priori hypotheses. As a field, it has introduced new concepts and algorithms such as association rule learning. It has also applied known machine-learning algorithms such as inductive-rule learning (e.g., by decision trees) to the setting where very large databases are involved. Data mining techniques are used in business and research and are becoming more and more popular with time.

1.2 CONFIDENTIALITY ISSUES IN DATA MINING

A key problem that arises during collection of data is to maintain its confidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests. However, there are situations where the sharing of data can lead to mutual gain. A key utility of large databases today is research, whether it be scientific, or economic and market oriented. Thus, for example, the medical field has much to gain by pooling data for research; as can even competing businesses with mutual interests. Despite the potential gain, this is often not possible due to the confidentiality issues which arise. Addressing this issue, it can be shown that highly efficient solutions are possible.

1.3 PRIVACY PRESERVING MECHANISM

Goal of data mining is to build models of real data. But the problem with data mining is that the real data is too valuable and thus difficult to obtain. Thus the solution is to add privacy to those data. Hence only information that is really necessary will be published to other parties, like parties learn only average values of entries.

The goal is to match similar records that represent distinct individuals, therefore matching based on unique identifiers is not applicable. This problem is known as the record matching problem. Since record matching is a key component of data integration methodologies, it has been investigated extensively. However, especially after the introduction of powerful data mining techniques, privacy concerns related to sharing of individual information have pushed research toward the reformulation of the problem and the development of new solutions, therefore introducing the concept of private record matching.

Private matching is a challenging problem, as in many situations uniquely identifying information may not be available, and matching is performed based on attributes like age, occupation, etc. Furthermore, such information may not always be completely consistent across data sets (e.g., the weight of a patient may vary between two admissions to different hospitals). Therefore, it is important to devise methods that are capable of privately matching records through a distance-based condition, rather than simple equi-joins computed using cryptographic hashes.

Two main approaches have been proposed for private matching. These are sanitization methods that perturb private information to obscure individual identity, and cryptographic methods that rely on Secure Multi-party Computation (SMC) protocols. Sanitization techniques such as k-anonymization or random noise addition involve a delicate tradeoff between accuracy and privacy: higher levels of protection translate into further deviation from the original data and consequently less accurate results. Cryptographic techniques do not sacrifice accuracy to achieve privacy. In general, the algorithms applied to private data are converted to complex binary circuits with private inputs. Then, using SMC protocols, accurate results are obtained.SMC protocols guarantee that only the final result and any information that can be inferred from the final result and the input is revealed .Such protocols have certain security parameters like encryption key sizes that allow users to trade-off between cost and privacy.

Compared to SMC, sanitization methods have two limitations: the level of privacy protection may not be sufficient and the matching result may exhibit false positives or negatives. On the other hand, SMC methods provide strong privacy and perfect accuracy, but incur prohibitive computational and communication cost. All existing SMC techniques requires O(n × m) cryptographic operations where n and m is the number of records in the first and second data set respectively. If n=m=10000,such an integration task will require 108 cryptographic operations.The cost of each individual operation is very high, and grows with the number of compared attributes. Consequently, none of these techniques are able to provide a solution addressing all relevant application requirements with respect to privacy, cost, and accuracy.

2. LITERATURE SURVEY

2.1 MULTIDIMENSIONAL K-ANONYMITY

K-Anonymity has been proposed as a mechanism for protecting privacy in microdata publishing, and numerous recoding “models” have been considered for achieving k-anonymity. A new multidimensional model is proposed which provides an additional degree of flexibility. Often this flexibility leads to higher-quality anonymizations, as measured both by general-purpose metrics and more specific notions of query answerability. Optimal multidimensional anonymization is NP-hard like previous optimal k-anonymity problems. However, a simple greedy approximation algorithm was introduced and experimental results of that algorithm shows that greedy algorithm frequently leads to more desirable anonymizations than exhaustive optimal algorithms for two single-dimensional models.

A number of organizations publish microdata for purposes such as demographic and public health research. In order to protect individual privacy, known identifiers (e.g., Name and Social Security Number) must be removed. In addition, this process must account for the possibility of combining certain other attributes with external data to uniquely identify individuals. For example, an individual might be “re-identified” by joining the released data with another (public) database on Age, Sex, and Zipcode.

K-anonymity has been proposed to reduce the risk of this type of attack. The primary goal of k-anonymization is to protect the privacy of the individuals to whom the data pertains. However, subject to this constraint, it is important that the released data remain as “useful” as possible. Numerous recoding modelshave been proposed in the literature for k-anonymization, and often the “quality” of the published data is dictated by the model that is used. Greedy algorithm for k-anonymization approach has several important advantages:The greedy algorithm is substantially more efficient than proposed optimal k-anonymization algorithms for single-dimensional models. The time complexity of the greedy algorithm is O (nlogn), while the optimal algorithms are exponential in the worst case. The greedy multidimensional algorithm often produces higher-qualityresults than optimal single dimensional algorithms.

2.2 PRIVACY PRESERVING DATAMINING

The field of privacy preserving data mining primarily focuses on performing useful data analysis in such a way as to mitigate the risk of releasing some private or secret information. On the surface, there are two distinct sets of problems in this field. The first set includes problems of how two or more separate parties each with private data, may compute some function of the union of their data without having to reveal it. The second set focuses on how to determine whether the result of a computation alone constitutes an invasion of privacy, and if so how to mitigate the release. When two parties need to link their private data and then perform some computation on the resulting linked records, both facets of PPDM are important to respect.

2.3 SECURE MULTIPARTY COMPUTATION

Suppose two parties each hold a separate piece of private data which they would benefit from jointly analyzing. For example, the parties may be administrators of hospitals or government agencies, who are bound by law to not disclose the information of individuals in their databases. Nevertheless they may wish to join their data to that of some medical research center or another agency in order to fit a statistical model to the union of their data. Performing such computations is the concern of a mature area in the PPDM literature called Secure Multi- party Computation” (SMC). The goal is to develop protocols consisting of local computations by individual parties, and the transmitting of messages between the parties.

Depending on the demands of the parties involved, one of several models of security may be appropriate. Perhaps the most well studied and rigorous formulation of a secure computation comes from cryptography. The idea is that the protocol should reveal no more information than would a fanciful “idealized” method in which the private data are presented to a completely trusted third party, who performs the computation and returns the results to each of the original parties. That is, to any specific party, the computation itself should reveal no more than whatever may be revealed by examining its input and output.

An example of a protocol that would fail to meet this criteria is if one party was sent all the private inputs, performed the computation locally and then broadcast the results to the other parties. The reason this fails is because, in general, the party who does the computation cannot infer the other’s data just from looking at his data and the result, and so the messages passed in the proposed protocol has revealed too much to him. If it is understood that the parties will follow the protocol, but will try to covertly infer whatever they may from the messages, then this is called the “semi- honest” or “honest but curious” model. Using techniques from cryptography it is theoretically possible to take a protocol for the semi-honest model and make it work under a malicious model, in which one of the parties tries to deviate from the protocol in order to reveal information. Generally though, when the task is inferred on joint data, it seems likely that both parties would benefit from the collaboration, and hence the semi-honest model may be a reasonable assumption.

2.4 PRIVATE RECORD LINKAGE WITH BLOOM FILTERS

In many record linkage applications, identifiers have to be encrypted to preserve privacy. Therefore, a method for approximate string comparison in private record linkage is needed. A new method of approximate string comparison in private record linkage has been described. The main idea is to store q-grams sets derived from identifier values in Bloom filters and compare them bitwise across databases. This exploits the cryptographic features of Bloom filters while nevertheless allowing the calculation of string similarities. This method compares quite well to evaluating string comparison functions with plain text values of identifiers.

2.5 PRIVACY-PRESERVING SET OPERATIONS

In many important applications, a collection of mutually distrustful parties must perform private computation over multisets. Each party’s input to the function is his private input multiset. In order to protect these private sets, the players perform privacy-preserving computation; that is, no party learns more information about other parties’ private input sets than what can be deduced from the result. By employing the mathematical properties of polynomials, a framework was built which is efficient, secure, and composable multiset operations: the union, intersection, and element reduction operations.

2.6 BLOCKING AWARE PRIVATE RECORD LINKAGE

The problem of quickly matching records from two autonomous sources without revealing privacy to the other parties is considered. In particular, it focuses mainly to devise secure blocking scheme to improve the performance of record linkage significantly while being secure. Although there have been works on private record linkage, none has considered adopting the blocking framework. Blocking-aware private record linkage can perform large-scale record linkage without revealing privacy.

2.7 ANONYMIZATION

Prior to public release of a data set, to protect individual privacy, unique identifiers such as social security numbers are removed. Sweeney shows in that this measure is not sufficient because quasi-identifier attributes can be combined with public directories to accurately identify individuals. Anonymization is one popular solution against such attacks. By generalizing the values of quasi-identifying attributes and/or removing entire records from the data set, anonymization methods try to satisfy certain definitions of anonymity. The most well known of such definitions is k-anonymity, which requires every combination of quasi-identifier values called an equivalence class to appear at least k times in the anonymized data set, so that an individual is indistinguishable within a group of size at least k. This model has been extended by many related works in the area such as l-diversity and t-closeness.

2.8 DIFFERENTIAL PRIVACY

The work proves in that every privacy protection mechanism is vulnerable to some kind of background knowledge. Instead of tailoring privacy definitions against different types of background knowledge, one should minimize the risk of disclosure that arises from participation into a database. This notion is captured by the differential privacy protection mechanism, which addresses the case of statistical databases where users are only allowed to ask aggregate queries. Differential privacy requires random noise to be added to each query result. The magnitude of the noise depends on the privacy parameter є, and sensitivity of the query set Q. Denoting the response to query Q over data set T with QT, sensitivity is defined as follows:

Definition 1 (L1-sensitivity [30]). Over any two views T1, T2 such that │T1│ × │T2│and T1, T2 differ in only one record, the L1-sensitivity of query set Q = {Q1, Qi} is measured as

SL1 (Q) = maxT1,T2 ∑qi=1 | QiT1 – QiT2 |

Theorem 1 gives a sufficient condition for a statistical database to satisfy differential privacy: Theorem 1. Let Q be a set of queries answered by a statistical database, and denote by SL1 (Q) the L1-sensitivity of Q. Then, differential privacy with parameter can be achieved by adding to each query result random noise X.

2.9 PRIVATE RECORD MATCHING

Record matching has been studied for more than four decades. However, few methods for private record matching have been investigated. Most studies in the field focus on private matching of string attributes (e.g., names and addresses). Now the focus is rather on numerical and categorical attributes. Closely related to this work, Al-Lawati et al, propose a secure blocking scheme to reduce costs. The approach has the disadvantage to work only for a specific comparison function. Also, as the focus is mainly on efficiency, the effectiveness of the approach has not been assessed.

Several approaches investigated the secure set intersection problem. Such methods deal with exact matching and are too expensive to be applied to large databases due to their heavy reliance on cryptography. Agrawal et al. formalize a notion of private information sharing across databases that relies on commutative encryption techniques,leading to several protocols.

Record matching is the process of identifying record pairs, across two input data sets, that correspond to similar (or the same) real-world entities. In essence, the problem consists of building a classifier that accurately classifies pairs of records as “match” or “nonmatch.” In the private record matching problem, an accurate classifier is assumed to be available. Therefore, private record matching methods focus on classifying all record pairs within the input data sets privately, accurately, and efficiently. We consider a matching scenario with three participants. These are data holder’s parties A and B with the data sets T and V, respectively, and a querying party QP that provides the classifier for identification of matching record pairs. In a real-world application, A and B could be hospitals and QP a researcher trying to match patients with similar characteristics such as geographical location, age and sex.

Without loss of generality, let T and V be represented as relations. Let us also assume that these relations have the same schema, T (A1,.. Ad) and V (A1, . . ., Ad). If not, schemas of T and V can be matched using private schema matching techniques. Given matching thresholds θi≥0 and distance functions di , Dom(T.Ai) × Dom(V.Ai)→ R+, defined over domains of corresponding attributes of T and V , record matching can be expressed as a join of T and V . For t є T and v є V, the join condition is a decision rule DR that returns true if di(t.Ai, v.Ai)≤0 for all attributes (1 ≤ i ≤ d) and false otherwise. Formally,

DR (t,v)= true , if 1 ≤ i ≤ d; di (t.Ai, v.Ai)≤θi False, otherwise

Our task is to identify decision rule in a privacy preserving manner such that the result will be available to the querying party QP and private records of the data holders that do not satisfy the join condition are not disclosed.

Two main approaches have been proposed for private matching.

1. Sanitization methods

2. Cryptographic methods

There are many limitations in the existing methods:

• The cost of each individual operation is very high, and grows with the number of compared attributes. Consequently, none of these techniques are able to provide a solution addressing all relevant application requirements with respect to privacy, cost, and accuracy.

• Sanitization techniques such as k- anonymization or random noise addition involve a delicate trade-off between accuracy and privacy: higher levels of protection translate into further deviation from the original data and consequently less accurate results.

• Cryptographic techniques do not sacrifice accuracy to achieve privacy.

• Compared to SMC, sanitization methods have two limitations: the level of privacy protection may not be sufficient and the matching result may exhibit false positives/negatives. On the other hand, SMC methods provide strong privacy and perfect accuracy, but incur prohibitive computational and communication cost.

3. SYSTEM DESCRIPTION

3.1 HYBRID APPROACH

A novel method is proposed to address private record matching by combining cryptographic and sanitization techniques.

The proposed private matching technique consists of three phases:

1. Partitioning: Each data holder independently partitions its records according to some privacy-preserving mechanism e.g., k-anonymization, ∈-differential privacy. The outcome is a set of smaller partitions whose extents are hyper-rectangles in the multidimensional attribute space.

2. Blocking: All pairs of partitions from the data holders are input to a blocking decision rule. By looking at the regions covered by the partitions, the blocking decision rule outputs either match, non-match, or unknown. Only records within pairs of partitions labelled unknown are input to the costly SMC step.

3. SMC: Pairs of records that are still not labelled are matched using cryptographic protocols. Matching record pairs are then added to the result. If the input data sets are too large, it is required to label significant amounts of record pairs using cryptographic techniques. In such cases, since cost of the private record matching process is not known in advance, data holder parties might be unwilling to participate. That is why limiting the costs of cryptographic techniques is considered. Further the cost-accuracy and the cost-privacy relationships are also analysed.

Fig. 3.1 Overview of the Hybrid Model

When the upper bound imposed on SMC costs is too low, some record pairs might remain unlabelled at the end of SMC. In order not to reveal irrelevant pairs, they are labelled as non matches. This precaution degrades recall since some of those unlabelled record pairs might actually be matching. Fortunately, based on the sanitized views output at the end of the partitioning step, pairs that are more likely to match can be given priority during the SMC step.

The hybrid approach combines sanitization methods with cryptographic methods in three steps. The first step, partitioning, produces sanitized views of the input data sets through perturbation. The second step of the hybrid approach is the blocking step, where pairs of partitions produced in the partitioning step are compared against one another based on the regions covered by each partition. The third step, namely the SMC step, labels any pairs of records that were not classified as match or nonmatch in the blocking step.

3.2 PARTITIONING STEP

A partition p consists of a set of points, Points (p) and a d-dimensional hyper-rectangle Region(p) such that for all t є Points(p) ═› t є Region(p). In other words, every point in Points (p) should be contained by the region of partition p. The interval covered by a region r on dimension Ai is denoted as [xi, yi], where xi is the lower bound on attribute Ai and yi is the upper bound.

Given data set D, a partitioning algorithm outputs a set of partitions PD= {p1,…pk}. The focus is primarily on space partitioning algorithms that cover the entire data space.

3.3 BLOCKING STEP

Given two regions R1 and R2, let diinf (R1, R2) denote the infimum distance between any pair of records within R1 and R2 over the ith dimension. Formally,

diinf (R1,R2)= inftє R1,vє R2 (di(t, v))

By definition, diinf (R1,R2) is the greatest lower bound on the distance. If diinf (R1, R2) >θi for some 1≤ i ≤ d, then no two points from R1 and R2 can match. The supremum distance is defined similarly as:

disup (R1,R2)= suptє R1,vє R2 (di(t, v))

By definition, disup (R1, R2) limits from above the maximum distance between two arbitrary points of R1 and R2. If these distance values never exceed the threshold for any attribute, then all points within R1 × R2 should match.

Based on infimum and supremum distance functions, the blocking decision rule BDR (R1, R2) can be defined as

Here, the return values M, N and U refer to match, nonmatch, and unknown, respectively. Not all pairs of regions can be classified as M and N. Whenever an accurate decision cannot be drawn, the pair is labelled U. Records in such regions will be labelled privately by SMC protocols.

3.4 OVERALL PROTOCOL FOR BLOCKING

Let {Ti} 1 ≤ i ≤ m (respectively {Vj} 1 ≤ j ≤ n) be the set of partitions extracted from data set T (respectively V). Algorithm describes the overall protocol for the blocking step. For every partition {Ti} 1 ≤ i ≤ m of T and {Vj} 1 ≤ j ≤ n of V, the blocking decision rule BDR is evaluated. In step 3, record pairs that will be labelled with SMC protocols are identified. Step 6 inserts matching record pairs to the result set.

Protocol for the blocking step

Require: T = {Ti} 1<i<m U T and V = {Vj}1<j<n U V

1: for all Partitions Ti 2 T do

2: for all Partitions Vj 2 V do

3: if BDR (Region (Ti), Region(Vj))= U then

4: Privately match Points (Ti) _ Points (Vj)

5: else if BDR(Region(Ti), Region(Vj))= M then

6: Add Points(Ti) _ Points(Vj) to the result

7: end if

8: end for

9: end for

Assuming that step 6 only marks the pair (Ti,Vj) as M and that step 4 is performed in the SMC step, Algorithm terminates in О(m × n) time.

3.5 SMC STEP

Considering each partition as a small data set by itself, any existing solution for privacy preserving record matching can be applied to match the set of non-blocked partition pairs. In classical SMC protocols, using some cryptographic assumptions, it can be proven that only the final results and anything that could be inferred by looking at the final results are revealed. This method provides security guarantees which are slightly different from the security guarantees provided by the generic SMC protocols. Implicitly it is assumed that disclosure of the output of our privacy preserving partitioning algorithms does not violate privacy. This is reflected in the privacy definition , where the goal is to reveal only the final record matching result, the privacy preserving partitioning of the data sets and anything that can be inferred from the result and the partitioned data sets. Since the blocking step only depends on pairs of partitions, it satisfies the goal stated above. In other words, anything revealed during the blocking step could be inferred from the partitioned data sets.

3.5.1 BASIC SMC PROTOCOL FOR RECORD MATCHING

For each pair of records that is not blocked, there is a need to securely learn whether such a pair actually matches or not. In other words, for each possibly matching record pair and for each attribute, we need to securely calculate whether di( t.Ai, v.Ai ) ≤ θi is satisfied. Such a secure calculation is possible using generic SMC circuit evaluation techniques. Also recently many protocols have been proposed using special encryption functions such as commutative encryption and homomorphic encryption.

Either these protocols, or any other SMC technique that can securely compute d(t, v) could be used in the SMC step.

3.5.2 LIMITED SMC BUDGET

Efficiency of a blocking scheme is measured by the reduction ratio (RR) metric. Given a baseline comparison space S, reduction ratio is the fraction of savings from the comparison space attained by the blocking scheme. The results are compared to the benchmark solution that privately evaluates DR over all record pairs in the Cartesian product T × V .Therefore, │S│═│T × V│═│T│×│V │.

Then, RR ═ 1 – number of secure decision rule evaluations

│T│×│V│

When the input data sets T and V are large, even after considerable reduction in comparison space, the cost of applying our solutions might be higher than the amount anticipated by the participants. In order to prevent concerns related to high costs from hampering the record matching process, an extension to the methods are discussed where participants can determine an upper bound on the number of SMC protocol invocations, called the SMC budget.

Similar to RR, we represent SMC budget as a fraction of the Cartesian product size:

SMC Budget = max. number of secure decision rule evaluations

│T│×│V│

For example, if │ T│ ═ │V│ ═103, then SMC budget ═ 0.01 implies DR will be evaluated at most 106 × 0.01 = 104 times using the SMC protocols. The number of record pairs that were not labelled after the blocking step, hence must be labelled in the SMC step, is (1 -RR)× │T ×│V │. If 1 –RR≤SMC budget, then there is no challenge in enforcing the limits over SMC operations because the budget meets or exceeds the need. However, when 1 -RR > SMC budget, then some record pairs cannot be properly labelled.

In order to prevent disclosure of irrelevant record pairs, we assume that all such records are excluded from the result set (i.e., assumed to be non-matching record pairs). Whenever SMC budget is insufficient, record pairs should be chosen carefully to maximize the number of matching record pairs found in the SMC step. This notion is captured by the recall measure. Let H be some heuristic that guides us in selecting the record pairs toward which the SMC budget is spent. Then, the recall of H, denoted RecallH, is the fraction of matching record pairs that H can identify in the SMC step. Formally, denoting matching record pairs by the set nM, the recall of heuristic Hs

RecallH = number of matching pairs found by H

| nM|

A naive approach to enforce SMC budget would be choosing a random subset of unlabelled record pairs. Yet, it makes more sense to use the information contained in partition regions. Below we discuss various heuristics that help identify possibly matching record pairs. Among these, the heuristic that has the maximum recall should be favoured. Selection Heuristics Our heuristics rank pairs of partitions. In the SMC step, pairs are processed according to these ranks. If SMC budget is low, then low-ranked pairs may be excluded and automatically labelled as “non-match”. The heuristics are outlined below. An empirical evaluation of these heuristics is provided.

H1—Minimum comparison cost first. In this heuristic, partitions of data set T are sorted with respect to the number of secure DR evaluations required to find all matching records of V . Then, the partitions are processed in ascending order. The idea is maximizing the fraction of records of T that are matched against V. H1- would be advantageous if the partitions were weighted based on some criteria. For example, partitions that contain individuals of a certain age group may be given priority over others.

H2—Minimum volume partition first. In this heuristic, partitions p of T are sorted with respect to the volume of their regions, Region(p). Then, partitions are processed in ascending order. Considering records as random variables supported over their partition regions, this heuristic assumes that lower volumes imply less uncertainty in estimating the actual value of a record. Based on this idea, partitions with the smallest region are processed first.

H3—Partition pair(p1, p2) with maximum Region(p1) ∩ Region(p2) volume first. This heuristic assumes the volume of the intersection between partition regions is an accurate indicator of possibly matching records. Therefore, pairs are ordered based on normalized intersection volumes and processed in descending order.

3.6 PRIVACY DEFINITION

Privacy guarantees of SMC techniques can be proven under reasonable assumptions. We believe that a similar theoretical framework is needed for our hybrid approach. To this end, we extend the basic definitions and techniques used in SMC so that they apply to our hybrid framework. It is focussed on security/privacy definitions of the semihonest model, where each party reveals some sanitized information about its data.

In the semihonest model a computation is secure if a party’s view during protocol execution can be effectively simulated based on its input and output. This does not imply that all private information is protected. Under this definition, disclosure of any information that can be deduced from the final result is not a violation.

We extend the basic semihonest model by including sanitized data in the form of anonymized data sets or statistical query results that satisfy differential privacy definitions. We assume that such sanitized data are public and accessible by all participants. Formally, let ā═ (a1, . . . . ,az) be the sanitized data of all the parties.

Let f : ({ 0, 1 }* )z → ({ 0, 1 }* )z be a probabilistic, polynomial-time functionality, where fi(x1, x2, . . . , xz) denotes the ith component of f(x1, x2, . . . , xz) and let ∏ be a z-party protocol for computing f. For I ═{i1,i2, . . . , it) C [z] where [z] denotes the set {1, 2, . . . , z}, we let fI(x1, x2, . . . , xz) denote the subsequence fi1 (x1, x2, . . . , xz),fi2 (x1, x2, . . . , xz) , . . . ,fit (x1, x2, . . . , xz). Let the view of the ith party during an execution of protocol ∏ on x ═ x1, x2, . . . , xz, denoted by view i∏(x), be (xi,ri,mi1, . . .; mit) where rirepresents the outcome of the ith party’s internal coin tosses, and mijrepresents the jth message received by third party. Also, given I ═ i1,i2, . . . , it, we let viewI∏(x) denote the subsequence (I, view i1∏(x). . . view it∏(x)).

In this context, three parties want to compute the record matching function f(T,V ,DR) where data set T (respectively V ) is the input of the first party (respectively the second party) and DR is the input of the QP. Also, it is defined as f1(T,V ,DR) = f2(T,V ,DR)=ϕ and f3(T,V ,DR)= decision rule( V )(i.e., the set of matched records). In addition, let _a be the union of the sanitized data released during the blocking step. The protocol privately computes record matching function f(T,V ,DR)if the above holds. Compared to the existing privacy definitions in the semihonest model, we assume that all sanitized data (e.g., anonymized data or differentially private statistical query results) are available to any coalition of parties. The objective of the privacy preserving protocol is to reveal nothing more than what can be inferred by all sanitized information, original inputs of the colluding parties and the final function result (here, the set of matching record pairs). In contrast to classic SMC models, the hybrid model can trade off privacy versus efficiency easily. If no sanitized data is revealed (i.e., ā═ϕ), this model will be equivalent to SMC models. On the other hand, by revealing sanitized data, it is possible to improve the efficiency of SMC protocols without sacrificing accuracy.

3.7 ADVANTAGES OF PROPOSED SYSTEM:

This hybrid approach has several advantages over existing methods, which can be summarized as follows:

• Costs are usually much lower than, and at worst equal to the costs of existing cryptographic techniques.

• Record pairs marked as “non- match” by the decision rule are protected 100 percent against disclosure.

• Recall varies with the upper bound on SMC costs imposed by participants.

• This method allows participants to trade-off between accuracy, privacy, and costs.

4. CONCLUSION AND FUTURE WORK

In this work, a novel approach is proposed that combines sanitization methods and cryptographic methods to solve the private record matching problem. Our method allows

participants to trade-off between accuracy, privacy, and costs. Empirical analysis of the proposed methods performed on real-world data indicates that the hybrid approach attains

significant savings in costs even at considerably high levels of privacy protection. Thus the hybrid approach allows us to compare two different datasets that includes alphanumeric characters.

A promising area of future research might be extending the idea of hybrid approaches to other privacy preserving data mining tasks. It is believed that the hybrid approach could provide substantial performance improvements for privacy preserving distributed data mining protocols.

Essay: Privacy, Accuracy and Cost in Record Matching with Hybrid Technique

Essay details and download:

Text preview of this essay:

Abstract

About this essay:

Essay details and download:

Text preview of this essay:

Abstract

About this essay:

Essay Categories: