Pseudonymization vs. Data Masking

By: Finn Turner

You can probably be identified, even from anonymized data. It’s not ideal, but depending on your organization, it might not be disastrous. Let’s explore what can happen if you reveal too much data, as well as what the concept of “too much” data can mean in different contexts. In addition, we’ll also explore how data masking can be combined with access controls to reduce the potential threat of data leaking or exposure, either within the organization itself or to external entities.

As leaks of data become more common, it’s crucial customer data isn’t stored in a way it can either be obtained directly or recreated from records. Anonymization (or pseudonymization) and encryption are typically put forward as the primary solution to these problems; however, neither are perfect solutions.

The Problem With Encryption and Anonymization

Encryption is useful if someone physically grabs your storage device and walks away with it. Encryption can also be used to ensure communications between two computers are secure enough to transmit data and keep it safe from prying eyes. What encryption can’t do is prevent an individual or program that’s obtained authorized access to your data from taking the data and doing something with it.

Anonymization is useful if you want to collect data about a group of people, but don’t want the individuals in that group to be vulnerable should the dataset be compromised. Anonymization clearly can’t be used in all contexts, but even where it’s reasonable to deploy it, anonymization isn’t perfect.

A study published in Nature Communications is an excellent example of why anonymization isn’t the panacea so many wish it was. This study indicates: even if anonymized, it’s usually easy to identify specific people in a dataset. This study also challenges the oft-held belief that releasing a subset of records, whether to the public or a group of private researchers, increases overall privacy.

According to the study, using 15 pieces of demographic information, even from an anonymized dataset, the researchers stated they’d be able to identify 99.98% of Americans, which doesn’t put anonymization in a good light; even releasing a 1% subset of the data could lead to identification and targeting.

That’s not to say anonymization doesn’t help; it simply indicates it’s not perfect. No security technology ever is.

What this data does do is reinforce an organization’s need to deploy defense in depth, using multiple layers of technologies to provide a higher level of security for their data. Technologies need to be combined with business practices to ensure sensitive data is exposed only to those individuals or systems that absolutely must have access to it, and only for the periods of time during which they absolutely must have access.

In addition, organizations need to undertake efforts to ensure that, even if data is exposed, it’s stored in such a way as to make it as difficult as reasonably possible for an attacker to identify and target real people. Anonymization can’t accomplish this alone, additional tools are required. One technology that can help with this is data masking.

Data Masking

In the real world, people have to work with real data to get things done. Developers need to test their applications against real data to make sure those applications do what they’re supposed to do.

Various people within any organization need to be able to pull up user and customer records to engage in any number of tasks. But not all data needs to be visible to all of these people at all times. This is where data masking comes in.

Data masking is, in practice, filling in a column in a database table with information that is garbage, but looks real. Data masking could apply to technologies other than databases; however, it’s predominantly found as a feature of database applications.

For example: Let’s say you have a table with user information and credit card numbers in it. You want to use that database in a development environment, but you don’t want to put those credit card numbers at risk. You would use data masking to strip out the real credit card numbers and fill them instead with fake ones that look real to a developer’s code, but aren’t real. Similarly, data masking could be used in contexts where a user or application doesn’t need access to specific types of data.

Let’s consider an employee working in shipping who’s accessing a database using an application that isn’t adequately designed for modern data security practices. When this application accesses a record, it must access the entire record. It doesn’t have the option of only reading partial records based on user context.

A database administrator could configure data masking on key columns the shipping employee doesn’t need to see (for example, credit cards), and do so for all accesses by that user. The database administrator could also set those columns to read in only for that user context, meaning the application they’re using can’t save the masked version of the data on top of the good data when a record is updated.

In this case the application would read in what it perceived to be a complete record, with the data formatted as it expects that data to be. Because the data is masked, however, the shipping employee would be unable to access the legitimate versions of the sensitive data they don’t have a need to access.

The security implications of this go beyond limiting exposure of employees to data they don’t need. It can, for example, be used to limit the exposure of scripts and other automated applications to sensitive data. Similarly, data masking used in this fashion minimizes the impact of compromise if an employee with limited access credentials is compromised.

Without data masking, our hypothetical shipping employee would’ve needed the rights to access all data, even data they didn’t explicitly need to see, because the application they were using required this. With the judicious use of masking, read-only settings, and a little luck regarding the design of the application, however, that employee can never access sensitive data they haven’t been explicitly cleared for. And neither can a hacker who has managed to get that shipping employee’s credentials.

Like encryption and anonymization, data masking isn’t perfect. It only applies in certain contexts, and its utility is largely dictated by the design of the applications that access the database employing that masking. That said, it’s an increasingly important tool in an organization’s information security toolbox, and both database administrators and developers should be aware it exists.

Leave a Reply