Autosync: Protecting Sensitive Data Through Masking

Protecting sensitive data is a major concern for most organizations. As part of business as usual, businesses routinely collect, process, store, and transmit data about their customers. Some of this data is necessary for the organization’s core business, while other data may be part of the organization’s profit model (i.e. collecting and selling data for targeted advertising). By Patrick Vernon, technology writer.

5 years ago Posted in

In recent years, data privacy regulations have become increasingly common and powerful. The EU’s General Data Protection Regulation (GDPR) and many similar laws have broadly defined what is considered “sensitive data” and increased the penalties for failure to properly secure this data. One of the biggest challenges of protecting sensitive data is finding all of it. With the advent of cloud storage and data processing, sensitive data is increasingly being stored at locations outside the organization’s control.

While an organization can institute a strict data encryption strategy that never allows sensitive data to leave the organization’s network unencrypted, this can impact usability and the ability to perform key business duties. Solutions like data masking can help bridge the gap between usability and security.

Lost in the Shuffle

One of the easiest ways to “lose” sensitive data is when it’s included in a bulk upload or placed on a cloud storage location where the user doesn’t understand how to properly configure security settings. While this data is still under the organization’s “control”, that doesn’t mean that other people can’t access it too.

Github Breaches

A prime example of the dangers of bulk uploads is the amount of data leaked on the content management site, Github. Github is designed to act as a version control service for developers, allowing them to track the history of code, push and pull potential modifications, etc.

The issue is that this source code is designed to do something, and this often requires access to other, valuable data or services. Over 100,000 Github repos have been found to leak API or cryptographic keys (for a total of over 200,000 unique, leaked keys). These keys can be used to gain access to the owner’s account on websites like Amazon, Facebook, Google, and Twitter or to remotely authenticate to a user’s computer (in the case of RSA cryptographic keys). These keys were unintentionally uploaded by owners but may completely compromise their accounts.

Leaked on the Cloud

The cloud can also be a scary place to store data. Data repositories like Amazon S3 buckets can provide convenient, cheap data storage; however, many people don’t know how to properly configure security settings. Most cloud storage (like S3) has two privacy options: private and public. In private mode, you need to be explicitly invited to gain access to the stored data. In public mode, anyone with the bucket’s URL can access it.

While it seems like no one would know your company’s S3 bucket URL, this approach of “security through obscurity” always ends up insecure. Tools exist that are designed specifically to scan for Amazon S3 bucket URLs, and they definitely find them.

Many S3-related sensitive data breaches have occurred, and it’s not limited to organizations that “wouldn’t know any better”. The US Pentagon has suffered several data breaches, including one that revealed dozens of terabytes of intelligence data collected using open source techniques like trawling through social media sites.

Wearing a Mask

Many organizations have security staff capable of setting up data storage in a secure manner, whether it’s properly configuring Amazon S3 permissions or setting up a Github repo so that it doesn’t upload sensitive information. However, it’s the repos that aren’t set up by these folks (and they almost certainly exist) that you really need to watch out for.

When dealing with cases like these, there is a narrow line to walk between security and usability. Some organizations will simply block all cloud storage sites to prevent employees from placing sensitive data on them. However, there are legitimate business uses for these sites, and inventive employees will always find a way around an annoying restriction.

Implementing a data masking solution is a good way to keep a balance between security and usability. These solutions are designed to sit between a repository of sensitive data (like your organization’s crown jewel database) and untrusted areas (the cloud, Internet, etc.). The solution scans for data matching certain patterns (emails, phone numbers, ID numbers, etc.) and replaces any detected instances with realistic but fake values.

With data masking, an organization can whitelist approved applications for receiving unmodified data, while sending sanitized data to any unauthorized, insecure repositories. If the user doesn’t need real data (i.e. using it to test software), then no harm done. If they do, they can submit a formal request for unsanitized data, allowing your security team to check out the application first. Either way, your sensitive data is protected the entire time.

Secure by Default

Many applications are designed to be “secure by default”, even Amazon S3 buckets default to private mode. However, users often disable security precautions without understanding the associated risks, opening an organization to the possibility of damaging and expensive data breaches. Implementing a data masking solution can help keep an organization “secure by default” since it invisibly protects your organization’s data without impacting any applications with a valid need for it. Since these controls are under your organization’s (not your users’) control, they’re also much less likely to be disabled when they become “inconvenient”.

Autosync: Protecting Sensitive Data Through Masking

Why you should consider a Databaseless architecture

How data is changing the way hotels streamline operations

Delivering deep-link analysis

Debunking the Top 5 Data Warehouse Myths and Misconceptions

Better late than never? Not for data

Quality Data is the Beginning, Middle, and End to True Business Intelligence

How businesses can gain better control of their data

Could data mesh drive data democratisation?