The following is a comprehensive summary of Chapter 9, "Governing Data Access," from the source text The Enterprise Big Data Lake.
Note: The source text for Chapter 9 spans approximately 20 pages (pages 157–177). To provide the most detailed summary possible within the constraints of the source material, this response synthesizes every concept, example, architecture, and workflow described in the chapter to provide a complete study guide of the material.
Providing analysts with access to data in a data lake presents a unique set of challenges compared to traditional data storage systems. The goal is to enable the broad, exploratory analysis required for data science while strictly adhering to security protocols and government regulations.
Data lakes differ fundamentally from traditional data warehouses in four specific ways that complicate governance:
While the simplest access model is to grant everyone full access, this is rarely legal or feasible due to regulations regarding Personally Identifiable Information (PII), copyrighted external data, and trade secrets.
There are two primary methods for managing permissions: traditional authorization and modern self-service approaches.
Authorization involves explicitly assigning permissions (such as read or update) to specific users for specific data assets (files or tables). Typically, administrators manage this by creating roles that group permissions and assigning users to those roles. While Single Sign-On (SSO) systems help streamline logins across applications, the maintenance of traditional authorization is costly and difficult.
Challenges with Traditional Authorization:
Some enterprises resort to a reactive approach—monitoring access logs to catch unauthorized usage after the fact. However, this fails to prevent intentional or accidental misuse. To address these issues proactively, data lakes employ three advanced strategies: Tag-Based Access Policies, Deidentification, and Self-Service Access Management.
Traditional access control relies on Access Control Lists (ACLs) set on physical files and folders. For instance, in the Hadoop File System (HDFS), an administrator might use a command like hdfs dfs -setfacl to grant the "human_resources" group read access to a specific file like /salaries.csv.
In a data lake with millions of files, setting permissions on individual files is impossible. Administrators typically resort to creating folder hierarchies (e.g., an hr_sensitive folder) and granting permissions at the folder level. However, this approach has significant limitations:
human_resources/engineering), requiring data movement and breaking existing applications.To handle new data safely, companies often create a "Quarantine Zone." New data enters this restricted zone and remains there until a human reviews it and assigns permissions. While logical, this manual review process fails at the scale of big data. If every new file created by a data science experiment required manual review before use, work would grind to a halt.
Modern systems like Apache Ranger (part of the Hortonworks distribution) and Cloudera Navigator utilize tag-based policies. Instead of assigning permissions to specific file paths, administrators assign permissions to tags (e.g., "PII" or "Sensitive"). These policies apply to any file carrying that tag, regardless of its physical location.
This solves the organizational complexity problem. To refine access, administrators simply add more specific tags (e.g., "Engineering") to the files. A policy can then be defined to allow access only if a user has both "HR" and "Engineering" credentials, without moving any data.
Tags also modernize the quarantine process. Instead of a physical quarantine folder, the ingestion system automatically tags new files as "Quarantine." A standing policy restricts access to the "Quarantine" tag to data stewards only. Once a steward reviews the file and tags it appropriately (e.g., adding "Sensitive"), they remove the "Quarantine" tag, automatically making the file available to authorized users.
However, "frictionless ingestion" creates a risk: data enters the lake so quickly that manual review is impossible. Analysts cannot manually scan million-row files to see if a "Notes" field contains a Social Security Number.
Automated Sensitive Data Detection The only viable solution is automation. Tools from vendors like Informatica, Dataguise, and Waterline Data automatically scan newly ingested files. They detect sensitive patterns (like credit card numbers) and automatically apply the corresponding tags. These tags are then exported to the security tools (like Apache Ranger) to enforce the correct policies immediately, removing the human bottleneck.
Identifying sensitive data allows administrators to restrict access to it. However, restricting access prevents analysts from using that data for legitimate models. To solve this, enterprises use encryption and deidentification.
Deidentification is the preferred method for analytics. It replaces sensitive values with randomly generated, statistically valid substitutes.
Challenges of Deidentification:
In cases where deidentification is too difficult—such as with complex XML Electronic Health Records where sensitive data could be anywhere—enterprises often default to keeping the data encrypted in a secure zone.
Global organizations must comply with data sovereignty laws (e.g., German data cannot leave Germany) and regulations like GDPR. A data catalog helps enforce these rules by tracking provenance (origin) and usage rights.
Tracking Origin: A "Provenance" property can be attached to data sets. If data originates in the US, the property is set to "USA." If that data is combined with data from a German ERP system, the resulting data set's provenance would list both "USA" and "Germany." Policies can then automatically enforce rules, such as "If Provenance contains Germany, block copy to non-EU servers".
Inferred Provenance: Profiling tools can also infer origin by analyzing the content. If a "Country" column contains 10,000 rows listing "Germany," the system can automatically tag the data set as containing German data and apply the necessary sovereignty protections.
The most agile approach to governance involves Self-Service Access Management. This model moves the access decision to the moment the data is actually needed, rather than trying to predict needs in advance.
In this model, analysts do not have access to the physical data initially. Instead, they have access to a metadata catalog. This catalog contains descriptions, tags, and samples of the data sets.
The workflow proceeds as follows:
This approach creates a full audit trail for compliance (who accessed what and why). It also prevents "permission bloat" because access can be set to expire automatically when the project ends.
The text details a specific provisioning logic for shared data copies to minimize storage and processing:
DW.Customers) for a project running from June 1 to August 5./20180601).Effective data lake governance relies on moving away from manual, file-level permissions. By combining automation (for sensitive data detection), tag-based policies (for managing complex rules), and self-service provisioning (for on-demand access), organizations can secure their data without stifling the agility required for modern analytics.