Privacy Preserving In Data Stream Using Sliding Window Method
Ankit Jasoliya, Tejal Patel.
Department of Information & Technology PIET, Parul University, Vadodara, India.
Assistant Professor ,Department of Information & Technology PIET, Parul University, Vadodara, India.
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. There is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information. It is necessary to analyse this huge amount of data and extract useful information from it. Extraction of information is not the only process we need to perform; data mining also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation .
Information is today probably the most important and demanded resource. We live in an internet worked society that relies on the dissemination and sharing of information in the private as well as in the public and governmental sectors. Governmental, public, and private institutions are increasingly required to make their data electronically available. So here need to protect the privacy of the respondents (individuals, organizations, associations, business establishments, and so on) .
A data stream is a sequence of unbounded, real time data items with a very high data rate that can only read once by an application . Imagine a satellite-mounted remote sensor that is constantly generating data. The data are massive (e.g. terabytes in volume), temporally ordered, fast changing, and potentially infinite. These features cause challenging problems in data streams field. Data Stream mining refers to informational structure extraction as models and patterns from continuous data streams. Data Streams have different challenges in many aspects, such as computational, storage, querying and mining .
2. LITERATURE SURVEY
The literature review is done to get depth knowledge of the basics of privacy preserving data mining. It is necessary to identity the various approaches and techniques that could be possibly used to preserve the sensitive data. The objective of literature review is to identity existing privacy preserving techniques, its pros and cons and to find efficient approach for preserving private or sensitive data .
2.1 Anonymization based PPDM
The basic form of the data in a table consists of following four types of attributes:
(i) Explicit Identifiers is a set of attributes containing information that identifies a record owner explicitly such as name, SS number etc.
(ii) Quasi Identifiers is a set of attributes that could potentially identify a record owner when combined with publicly available data.
(iii) Sensitive Attributes is a set of attributes that contains sensitive person specific information such as disease, salary etc.
(iv) Non-Sensitive Attributes is a set of attributes that creates no problem if revealed even to untrustworthy parties .
Anonymization refers to an approach where identity or/and sensitive data about record owners are to be hidden. It even assumes that sensitive data should be retained for analysis. It\'s obvious that explicit identifiers should be removed but still there is a danger of privacy intrusion when quasi identifiers are linked to publicly available data. Such attacks are called as linking attacks. For example attributes such as DOB, Sex, Race, and Zip are available in public records such as voter list.
Figure 1: Linking Attack
Such records are available in medical records also, when linked, can be used to infer the identity of the corresponding individual with high probability as shown in Figure.1.
Sensitive data in medical record is disease or even medication prescribed. The quasi-identifiers like DOB, Sex, Race, Zip etc. are available in medical records and also in voter list that is publicly available. The explicit identifiers like Name, SS number etc. have been removed from the medical records.
. Still, identity of individual can be predicted with higher probability. Sweeney  proposed k-anonymity model using generalization and suppression to achieve k-anonymity i.e. any individual is distinguishable from at least k-1 other ones with respect to quasi-identifier attribute in the anonymized dataset. In other words, we can outline a table as k-anonymous if the Q1 values of each raw are equivalent to those of at least k- 1 other rows. Replacing a value with less specific but semantically consistent value is called as generalization and suppression involves blocking the values. Releasing such data for mining reduces the risk of identification when combined with publically available data. But, at the same time, accuracy of the applications on the transformed data is reduced. A number of algorithms have been proposed to implement k-anonymity using generalization and suppression in recent years.
Although the anonymization method ensures that the transformed data is true but suffers heavy information loss. Moreover it is not immune to homogeneity attack and background knowledge attack practically . Limitations of the k-anonymity model stem from the two conventions. First, it may be very tough for the owner of a database to decide which of the attributes are available or which are not available in external tables. The second limitation is that the k-anonymity model adopts a certain method of attack, while in real situations; there is no reason why the attacker should not try other methods. However, as a research direction, k-anonymity in combination with other privacy preserving methods needs to be investigated for detecting and even blocking k-anonymity violations.
2.2 Perturbation Based PPDM
Perturbation being used in statistical disclosure control as it has an intrinsic property of simplicity, efficiency and ability to reserve statistical information. In perturbation the original values are changed with some synthetic data values so that the statistical information computed from the perturbed data does not differ from the statistical information computed from the original data to a larger extent. The perturbed data records do not agree to real-world record holders, so the attacker cannot perform the thoughtful linkages or recover sensitive knowledge from the available data. Perturbation can be done by using additive noise or data swapping or synthetic data generation.
In the perturbation approach any distribution based data mining algorithm works under an implicit assumption to treat each dimension independently. Relevant information for data mining algorithms such as classification remains hidden in inter-attribute correlations. This is because the perturbation approach treats different attributes independently. Hence the distribution based data mining algorithms have an intrinsic disadvantage of loss of hidden information available in multidimensional records. Another branch of privacy preserving data mining that manages the disadvantages of perturbation approach is cryptographic techniques.
2.3 Randomized Response Based PPDM
Basically, randomized response is statistical technique introduced by Warner to solve a survey problem. In Randomized response, the data is twisted in such a way that the central place cannot say with chances better than a predefined threshold, whether the data from a customer contains correct information or incorrect information. The information received by each single user is twisted and if the number of users is large, the aggregate information of these users can be estimated with good quantity of accuracy. This is very valuable for decision-tree classification. It is based on combined values of a dataset, somewhat individual data items. The data collection process in randomization method is carried out using two steps . During first step, the data providers randomize their data and transfer the randomized data to the data receiver. In second step, the data receiver rebuilds the original distribution of the data by using a distribution reconstruction algorithm. The randomization response model is shown in Figure.2.
Figure 2: Randomization Response Model
Randomization method is relatively very simple and does not require knowledge of the distribution of other records in the data. Hence, the randomization method can be implemented at data collection time. It does not require a trusted server to contain the entire original records in order to perform the anonymization process . The weakness of a randomization response based PPDM technique is that it treats all the records equal irrespective of their local density. These indicate to a problem where the outlier records become more subject to oppositional attacks as compared to records in more compressed regions in the data . One key to this is to be uselessly adding noise to all the records in the data. But, it reduces the utility of the data for mining purposes as the reconstructed distribution may not yield results in conformity of the purpose of data mining.
2.4 Condensation approach based PPDM
Condensation approach constructs constrained clusters in dataset and then generates pseudo data from the statistics of these clusters . It is called as condensation because of its approach of using condensed statistics of the clusters to generate pseudo data. It creates sets of dissimilar size from the data, such that it is sure that each record lies in a set whose size is at least alike to its anonymity level. Advanced, pseudo data are generated from each set so as to create a synthetic data set with the same aggregate distribution as the unique data. This approach can be effectively used for the classification problem. The use of pseudo-data provides an additional layer of protection, as it becomes difficult to perform adversarial attacks on synthetic data. Moreover, the aggregate behaviour of the data is preserved, making it useful for a variety of data mining problems . This method helps in better privacy preservation as compared to other techniques as it uses pseudo data rather than modified data. Moreover, it works even without redesigning data mining algorithms since the pseudo data has the same format as that of the original data. It is very effective in case of data stream problems where the data is highly dynamic. At the same time, data mining results get affected as huge amount of information is released because of the compression of a larger number of records into a single statistical group entity .
2.5 Cryptography Based PPDM
Consider a scenario where multiple medical institutions wish to conduct a joint research for some mutual benefits without revealing unnecessary information. In this scenario, research regarding symptoms, diagnosis and medication based on various parameters is to be conducted and at the same time privacy of the individuals is to be protected. Such scenarios are referred to as distributed computing scenarios .The parties involved in mining of such tasks can be mutual untrusted parties, competitors; therefore protecting privacy becomes a major concern. Cryptographic techniques are ideally meant for such scenarios where multiple parties collaborate to compute results or share non sensitive mining results and thereby avoiding disclosure of sensitive information. Cryptographic techniques find its utility in such scenarios because of two reasons: First, it offers a well defined model for privacy that includes methods for proving and quantifying it. Second a vast set of cryptographic algorithms and constructs to implement privacy preserving data mining algorithms are available in this domain. The data may be distributed among different collaborators vertically or horizontally .
All these methods are almost based on a special encryption protocol known as Secure Multiparty Computation (SMC) technology. SMC used in distributed privacy preserving data mining consists of a set of secure sub protocols that are used in horizontally and vertically partitioned data: secure sum, secure set union, secure size of intersection and scalar product. Although cryptographic techniques ensure that the transformed data is exact and secure but this approach fails to deliver when more than a few parties are involved. Moreover, the data mining results may breach the privacy of individual records. There exist a good number of solutions in case of semi-honest models but in case of malicious models very less studies have been made .
3. PROPOSED APPROACH
Many techniques are present for privacy preserving in data mining but they have some shortcomings like information loss and data utility. This research work is mainly focus on using Perturbation techniques to preserve the privacy, increase data accuracy and decrease information loss.
First of all here data stream generated by MOA or take dataset from UCI data repository. The goal is to transform a given data set S into modified version Sâ€™ that satisfies a given privacy requirement and preserves as much information as possible for the intended data analysis task. But in existing system make combination of multiple column value (Numeric value) then provide privacy to only one column value. So using new proposed method Sliding Window method we remove this drawback of existing system and provide privacy to individual column value using only one column.
Now apply classification method (Hoeffding Tree) for data stream mining on perturb dataset Sâ€™. So its generate classification model and then compare both result with respect to various evaluation parameters. In this way proposed method increase the accuracy of data stream classification and provide privacy to this data steam also.
Figure 3: Framework for privacy preserving in data stream classification
See following table in that table there are 3 numeric attribute (Age, Salary, and Education Level) and 2 non-numeric attribute (Name and Gender)
Table 1: Original Dataset
Name Age Gender Salary Bonus
James 25 M 25000 4563.45
Bob 22 M 34000 2314.34
Alice 24 F 23400 3498.56
Prince 28 M 34500 4467.00
Numeric attribute is 3.Suppose selected attribute is salary.
First for first row:
â€¢ So add all this ex. 25 + 25000 + 4563.45 = 29588.45
â€¢ Mean= 29588.45/3= 9862.81
â€¢ So replace salary attribute values 25000 by mean value that is 9862.81
Then second row:
â€¢ So add all this ex. 22 + 34000 + 2314.34 = 36334.34
â€¢ Mean= 36334.34/3= 12112.11
So replace salary attribute values 34000 by mean value that is 12112.11
After complete the calculation of all row output dataset will be like following:
Table 2: Output of Existing System
Name Age Gender Salary Bonus
James 25 M 9862.81 4563.45
Bob 22 M 12112.11 2314.34
Alice 24 F 8974.18 3498.56
Prince 28 M 12998.33 4467.00
Now we provide privacy using Proposed approach. Numeric attribute is 3.Suppose selected attribute is salary.
First for first row:
â€¢ Window size is 3.
â€¢ So add all this ex. 25000 + 34000 + 23400 = 82400
â€¢ Mean= 82400/3= 27466.66
â€¢ So replace salary attribute values 25000 by mean value that is 27466.66
Then second row:
â€¢ So add all this ex. 34000 + 23400+34500 = 91900
â€¢ Mean= 91900/3= 30633.33
â€¢ So replace salary attribute values 34000 by mean value that is 30633.33
After complete the calculation of all row output dataset will be like following:
Table 3: Output of Proposed System
Name Age Gender Salary Bonus
James 25 M 27466.66 4563.45
Bob 22 M 30633.33 2314.34
Alice 24 F 27633.33 3498.56
Prince 28 M 31166.66 4467.00
Now we have to perform classification on this perturb dataset using Hoeffding Tree and classified the data. We also make experiment on Bank Marketing Dataset - Bank marketing dataset taken1 from UCI dataset repository is related with direct marketing campaigns of a Portuguese banking institution, and it contain 45211 instances and 17 attributes.
The main objective of privacy preserving data mining is developing algorithm to hide or provide privacy to certain sensitive information so that they cannot be disclosed to unauthorized parties or intruder. So based on experiment we can says that Existing system provide privacy to data stream but not provide accuracy to the dataset so this proposed approach remove the drawback of existing system and provide privacy to data stream and also increase the accuracy of the dataset .Here we provide privacy to only numeric value so in Future works we can extend this work and provide privacy to non-numeric value also.
I am very obliged to Dr. N. D. Shah, Principal of Parul Institute of Engineering and Technology for providing facilities to achieve the desire milestone. I also extend my thanks to Head of Department Dr. Gokulnath K for his inspiration and continuous support. I wish to warmly thank my guide, Ms. Tejal Patel for all her diligence, guidance, encouragement, inspiration and motivation throughout. Without her treasurable advice and assistance it would not have been possible for me to attain this landmark. I take this opportunity to thank all my friends for their support and help in each and every aspect.
. Kiran Patel, Hitesh Patel, Parin Patel, â€œPrivacy Preserving in Data stream classification using different proposed Perturbation Methods â€, IJEDR, 2014, Volume 2, Issue 2 | ISSN: 2321-9939.
. Manish Shannal, Atul Chaudhar, Manish Mathuria, Shalini Chaudhar, Santosh Kumar, â€œAn Efficient Approach for Privacy Preserving in Data Mining â€, International Conference on Signal Propagation and Computer Technology, IEEE 2014,244-249.
. Radhika Kotecha, Sanjay Garg, â€œData Streams and Privacy: Two Emerging Issues in Data Classificationâ€, 5th Nirma University International Conference on Engineering (NUiCONE), IEEE 2015.
. Rupinder Kaur and Meenakshi Bansalt, â€œTransformation Approach for Boolean Attributes in Privacy Preserving Data Mining â€ 1st International Conference on Next Generation Computing Technologies, IEEE 2015,644-648. . Neha Pathak, Shweta Pandey, â€œAn Efficient Method for Privacy Preserving Data Mining in Secure Multiparty Computation â€, Nirma University International Conference on Engineering, IEEE 2013,1-3.
. Dhanalakshmi.M, Siva Sankari.E â€œPrivacy Preserving Data Mining Techniques-Surveyâ€, ICICES, IEEE 2014, ISBN No.978-1-4799-3834-6/14.
. Hina Vaghashia, Amit Ganatra â€œA Survey: Privacy Preserving Techniques in Data Mining â€, International Journal of Computer Applications (0975 â€“ 8887) Volume 119 â€“ No.4, June 2015
. C. Clifton, M. Kantarcioglu, and J. Vaidya, â€œDefining Privacy for Data Miningâ€, Next Generation Data Mining, AAAI/MIT Press, 2004.
. Sweeney L, \"Achieving k-Anonymity privacy protection uses generalization and suppression\" International journal of Uncertainty, Fuzziness and Knowledge based systems, 10(5), 571-588, 2002.
. Gayatri Nayak, Swagatika Devi, \"A survey on Privacy Preserving Data Mining: Approaches and Techniques\", ternational Journal of Engineering Science and Technology, Vol. 3 No. 3, 2127-2133, 2011.
. Charu C. Aggarwal, Philip S. Yu â€œPrivacy-Preserving Data Mining Models and algorithmâ€ advances in database systems 2008 Springer Science, Business Media, LLC .http://www.tutorialspoint.com/data_mining/data_mining_tutorial.pdf
. Jiawei Han, Micheline Kamber, Jian Pei. Data Mining Concepts and Techniques: 3rd Edn; Morgan Kaufmann Publishers is an imprint of Elsevier. 225 Wyman Street, Waltham, MA 02451, USA.
...(download the rest of the essay above)