According to analysts, IT organizations with Apache Hadoop deployments should be aware of the potential security problems. In particular, the use of Hadoop to combine and store data from several sources can result in a number of problems related to identifying and securing sensitive data. Hadoop deployments can include a variety of data classifications with disparate security requirements. The key to ensuring compliance is to select the appropriate security solution for the Hadoop distribution.
In the qualitative enterprise user Hadoop survey conducted by Dataguise, data from 62 enterprise respondents was collected during the recently held O’reilly Strata and RSA Conferences. Key findings of the survey included the following:
• 80% of the enterprises surveyed feel it is important to know whether sensitive data is stored in their Hadoop environment.
• 77% feel it is important to protect access to the sensitive data stored in their Hadoop environment.
• 33% store sensitive data in Hadoop, including social security numbers, credit card numbers and addresses.
• 43% of survey participants are currently testing Hadoop and 31% have active production environments.
• Data in Hadoop environments consists primarily of log files (55%), followed by structured DBMS data (36%) and mixed data types (24%).
• Company divisions using Hadoop include marketing (28%), sales (23%), customer support (23%) and the balance by other divisions.
• Major challenges faced during Hadoop implementations include lack of skills (35%), Hadoop usability (23%) and security management (21%).
As petabytes of new data accumulate and propagate across businesses, much of this data comes from external sources and from customer interaction channels, such as web sites, call centers, Facebook, and Twitter. Other data originates from traditional data repositories such as RDBMS and file servers. To mine these large volumes and varieties of data in a cost efficient way, companies are adopting new technologies such as Apache Hadoop. Line of business managers are benefiting from Hadoop and its ability to enable the analysis of data patterns previously inaccessible but security officers are concerned about the nature of the information and its uncontrolled accessibility. They are well aware of the potential catastrophic financial losses and the brand damage that compliance breaches can cause to their business.
To address the challenges of Hadoop data privacy, organizations require proactive detection and protection. The ability to locate and identify sensitive data across all Hadoop clusters provides compliance experts with the intelligence and assurance they need to evaluate a company's exposure and risk. Depending on the types of data, a solution that enforces the appropriate remediation policies, such as data masking or data quarantine, should remain a priority. Additionally, the ability to centrally manage and schedule detection and protection actions will make compliance enforcement transparent and automatic across all instances of Hadoop, on premises and in the cloud.
“Organizations require a straightforward and economical way to determine where sensitive data is and how to effectively secure their Hadoop environments,” said Manmeet Singh, CEO, Dataguise. “The data here shows that data privacy protection is important to Hadoop users and they are actively engaging security personnel to find ways to detect and protect sensitive data to meet compliance requirements. Using solutions such as DG for Hadoop™ by Dataguise allows for proactive actions to be taken while alleviating the complexity and cost of data privacy protections.”