Today, data is more important than ever. Across every industry today, there is the possibility to uncover unprecedented insights from the datasets that many organizations hold. There have been many advancements in technology due to relevant information found in data. Access to data is important for utilizing many technological innovations such as machine learning, that are exciting and becoming increasingly more prevalent today.
Although there are many benefits to the use of data, there is a growing concern regarding the privacy of citizens around the world. In the last decade, there have been a breath-taking series of data scandals. Some of these include, Cambridge Analytica collecting and exploiting Facebook data, the breach of Marriot International resulting in the passport numbers of 25 million individuals being stolen, and NYC’s Taxi and Limousine Commission releasing 173 million trips made by New York taxis, not realizing that passengers were easily re-identifiable. A huge consequence of a data breach is the loss of credibility with customers, which can hurt a business tremendously. These scandals highlight the need to protect data, whether it is being published for reuse, or if it is simply being stored.
Additionally, due to the recent enforcement of the European Union’s General Data Protection Regulation (GDPR), many businesses have the need for data protection. “The GDPR not only applies to all organizations located within the EU but also applies to organizations outside the EU if they offer goods or service to, or monitor the behavior of, EU data subjects.” The latter part of the pervious statement is the reason that there is a global market for data protection now more than ever. These companies, that process or store large amounts of data, must be compliant with the GDPR in order to prevent being at risk for regulatory sanctions and potential litigation.
Another regulation, the US federal law Health Insurance Portability and Accountability Act, 1996 (HIPAA), is driving factor for anonymization. The HIPAA Privacy Rule provides national standards to protect the medical records and the protected health information (PHI) of individuals. More specifically, it requires 18 specific identifiers to be protected. Healthcare organizations that wish to share PHI must act in accordance with the HIPAA privacy rule.
The two main methods of protecting data are pseudonymization and anonymization. The table below provides basic examples of pseudonymization and anonymization.
Pseudonymization is the process of replacing personally identifiable data (PII) with artificial identifiers, or pseudonyms, which can later be re-identified to once again access the data. Article 3 of the GDPR defines pseudonymization as, “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to and identified or identifiable natural person.” Protegrity’s patented vault-less tokenization is an example of pseudonymization, as in order to access the information you would need to retrieve the virtual key that was used to scramble the information, which is heavily protected.
Data anonymization is the process of removing or modifying personally identifiable data in data sets, thereby ensuring that those described in the data remain anonymous. Article 29 of the GDPR says that in order to achieve anonymization , “data must be processed in such a way that it can no longer be used to identify a natural person by using ‘all the means likely reasonably to be used’ by either the controller or the third party.” Unlike pseudonymization, anonymization is irreversible. At face value this may seem as simple as removing personal identifiers such as social security numbers, or the names of individuals like in the table above, but upon further introspection it becomes clear that this simply won’t suffice. A quasi-identifier is piece of information that isn’t itself a personal identifier, but when combined with other pieces of information, can form a personal identifier. In a 2002 paper published by Latanya Sweeney, she showed that around 87 percent of the US population can be uniquely identified with just a few quasi-identifiers; their 5-digit zip code, gender, and date of birth.
What makes anonymization the more appealing option for many companies, is the fact that anonymous data is not considered personal data for the purpose of the GDPR. This is due to the fact that pseudonymous data can still be re-identified in some form, while anonymous data can’t be. Recital 26 of the GDPR states, “The principles of data protection should therefore not apply to anonymous information, namely information which does not related to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” This means for anonymous data companies don’t require consent to process it, can store it indefinitely, can export it internationally, and use it for purposes other than what it was collected for, which includes selling it. Being able to share data externally without putting the privacy of individuals at risk is a huge advantage to anonymous data.
Similarly, de-identification of protected health information means that the HIPAA privacy rule no longer applies. This due to the fact that the privacy rule only applies to identifiable information. If PHI is de-identified and the identity of individuals can no longer be determined, then PHI can be freely shared. This is extremely important in the healthcare industry as this allows for the use of PHI in various academic studies.
Anonymization of data gives companies a lot more freedom in how they use their data, allowing them to essentially use their data for any purpose they see fit. With the regulations of HIPAA, and the recent enforcement of the GDPR, it is clear the time for anonymization is now.
2. SWOT Analysis
The market for data anonymization looks extremely promising, but it is important to take into account all factors that may affect it. Outlined below is an analysis of all the factors that may affect the market.
- The market for it is very big and rapidly growing
- Demand for a Protegrity anonymization product will be high and its sale will be aided by our already established network
- Could potentially be an add-on to existing software
- Anonymization is a double-edged sword; there is tradeoff between risk and utility of the data
- Creating an anonymization software will be difficult
- There are many different anonymization methods which will need to be considered when creating the software
- Companies face limiting factors in implementation of anonymization
- Companies from various industries are collecting and processing data, and the need for it is very widespread
- An anonymization software will allow Protegrity to satisfy the needs of more customers and expand its data protection capabilities
- There are already established companies in the market space that Protegrity will have to compete with
3. Potential Use Cases
Outlined below are a few potential use cases for data anonymization. These are just a few of the many that exist but serve to show the wide range of uses that data anonymization has.
3.1. Big Data Anonymization
Big data analytics has made its way to the forefront of today’s digital age and has revolutionized the way companies are able to operate. Although big data has been around for many years, companies are only now beginning to realize that by applying analytics to all the data streams into their business, they can extract significant value from their data. Big data analytics helps companies to recognize new opportunities that they wouldn’t have been able to otherwise. It is particularly helpful for cost reduction, better decision making, and creating new products and services. Many industries have begun adopting big data as can be seen in the figure below. This figure from 2017, shows the percentage of companies in each industry using big data,
The following figure, also from 2017, shows that forty-one percent of respondents said they currently use big data, which is a greater than two-fold increase from 2015. Additionally, forty-six percent of respondents said they may use big data in the future.
Companies are eager to utilize the large amounts of data they are collecting but at the same time want to be compliant with the GDPR in order avoid data protection violations, for which the fines have increased significantly. This data can be useful in anonymous form, as the information can still be used without knowing to whom it refers.
3.2. Electronic Health Records
In the Healthcare industry electronic health records (EHRs) provide numerous clinical advantages. EHRs contain a significant amount of information including demographics, clinical data, pharmacy and billing claims, among other patient information, which the industry can use to change the way it conducts clinical research and provide healthcare in the future. By using EHRs, it is possible to expedite recruiting patients for clinical trials, optimize patient safety, and streamline data capture thus saving both time and money. The benefits of merging EHR data with clinical trials are numerous.
However, publishing raw EHRs may possibly be considered as a breach of privacy, as they often contain the sensitive information of individuals. Therefore, in order to preserve privacy, it is common practice to anonymize the data before publishing, using privacy models such as k-anonymity.
The following figure was obtained from a survey of 100 professionals in the healthcare industry in the U.S. and Canada conducted by Privacy Analytics, a major player in the area of data anonymization. It shows what type of data is anonymized in the healthcare industry, with electronic health records (EHR) being the leading source. The other forms of data in the figure also represent other use cases of anonymization in the healthcare industry.
...(download the rest of the essay above)