Beyond Encryption: Protect sensitive data using t-closeness

Businesses and organizations now hold more personal information than ever before. Storing a lot of data may be useful in a variety of ways, such as reporting and analytics, which might expose PII that is linked to the data being analyzed. When data is being transmitted or stored, encryption is useful for protecting it, whereas anonymization is important for preserving data while it’s being utilized or released. Anonymization is better suited to smaller datasets while encryption works better with larger ones. For compliance purposes, a practical approach can be to combine encryption for data rest and in-transit and anonymization for data in use.

The goal of anonymizing data is not merely to obfuscate; it’s critical not to allow re-identify the data after anonymization. This implies considering metrics such as the amount of data, the kind of information in it, and the risk of identification while anonymizing. For most businesses, the goal will be to anonymize sensitive data such as PHI/PHI. In an earlier post, I spoke about using k-anonymity for protecting sensitive data. However, there had been research that has questioned the effectiveness of k-anonymity, particularly for large datasets. I followed up with another post about l-diversity, a technique that addresses some of the concerns around k-anonymity. This post will discuss t-closeness, which deals with some concerns around l-diversity.

Potential issues with l-diversity

When you talk to a researcher, they will confirm that l-diversity is more correct that k-anonymity. There are very few implementations of this algorithm in actual use, though. A lot of the utility of the dataset is lost. Maybe too much loss just two fix two issues with k-anonymity. And you get two more attack vectors: skewness and similarity.

Similarity attacks are based on the idea that attributed cane be l-diverse but syntactically similar. If you have three attributes: lung cancer, liver cancer and stomach cancer in a dataset, you have satisfied diversity. But an intruder can identify that a subject has cancer. In a skewness attack, there may be a grouping where half the patients have heart disease and the other half do not. If an intruder identifies the target as belonging to that group, then they can infer a 50% chance of a heart condition which if a far higher probability than average.

There is a third option, t-closeness, which addresses the skew and similarity attacks.

What is t-closeness?

The t-closeness model is a refinement of the idea of l-diversity. One feature of the l-diversity model is that it handles all values for a given attribute in the same manner, regardless of their relative proportion in the data. Because real data sets may have very uneven attribute values, this is seldom the case. Background knowledge of the global distribution is frequently used by an adversary to make inferences about critical values in the data. This may make it more difficult to create feasible l-diverse representations. This is the skewness vulnerability of l-diversity.

Not all attribute values are equally important. For example, a disease-related attribute may be more sensitive when the value is positive rather than negative.

The t-closeness model was proposed which uses the property that the distance between the distribution of the sensitive attribute within an anonymized group should not be different from the global distribution by more than a threshold t.

Some studies have shown the t-closeness approach tends to be more effective than many other privacy preserving data mining methods for the case of numeric attributes as well.

Summary of k-anonymity, t-closeness and l-diversity

Data privacy is an important consideration for any business, and it can be difficult to balance the need to protect sensitive data with other needs like ease of use. It’s imperative that you take time to do a thorough inventory of all your company’s sources of customer information in order to identify what should be anonymized or encrypted. Once you know this, it will make choosing among k-anonymity, t-closeness and l-diversity much easier.

For example, most organizations will want to start with k-anonymity where k has a value of ~10. That seems to satisfy most business use cases. You may want to revisit this practice after all the issues have been worked out and see if there is a need to include t-closeness.

Practical Implementation

To be effective, these algorithms need to be implemented. Most of the research has been about privacy concerns around specific datasets to be released to the public. We are concerned about reducing the potential threat of non-malicious insider activity by protecting internal datasets. In the next post, I’ll identify some practical implementations.

Beyond Encryption: Protect sensitive data using t-closeness