This chapter explores the transition from traditional data warehousing to modern data lake architectures. It details how data lakes can replicate the essential functions of data warehouses while overcoming their limitations regarding scale, cost, and agility. The chapter further explains how to expand a "data pond" (a data warehouse built on big data technology) into a full "data lake" by incorporating raw, external, and streaming data, and how to manage transformations and downstream consumption.
Essential Functions of a Data Warehouse To understand the utility of a data lake, one must first understand the functions traditionally performed by data warehouses. Originally envisioned as repositories for all historical enterprise data, data warehouses evolved into highly managed systems containing production-quality data for critical analytics. However, due to high costs and performance tuning requirements, they became rigid; changes required lengthy architectural reviews, preventing ad hoc querying and agility.
Despite these limitations, data warehouses perform four critical functions that data lakes must also address to be successful:
Dimensional Modeling and Schema Design Operational systems utilize normalized data models to minimize redundancy and optimize for small, fast transactions. in contrast, data warehouses utilize denormalized models (such as star schemas) where tables contain as many attributes as possible to allow analytical applications to process information in a single pass. A star schema typically consists of dimension tables (representing entities like customers or products) and fact tables (representing activities like orders).
Because data lakes use massively scalable and cost-effective technology (like Hadoop), they do not strictly require dimensional modeling or aggregation for performance reasons alone. However, the logical challenges of mapping diverse source data into a common format for analysis remain.
Preserving History and Slowly Changing Dimensions A critical limitation of traditional data warehouses is the high cost of storage and processing, which forces enterprises to aggregate or archive older data. For example, a warehouse might keep detailed transactions for only three years, summarizing older data into daily or monthly totals. This aggregation results in a loss of detail, complicates query writing, and limits historical trend analysis.
Big data platforms allow enterprises to store historical data at the finest granularity indefinitely. This enables richer analysis, such as the case of Visa, which moved from analyzing 2% of transactions to 100% of transactions using Hadoop, drastically improving fraud detection.
In traditional warehouses, Slowly Changing Dimensions (SCD) are used to track changes in entity states (e.g., a customer changing marital status) so that transactions are analyzed against the correct historical state of the customer. Implementing SCDs in a data lake can be complex because it requires joining temporal data during analysis.
Keeping History in a Data Pond: Partitioning and Snapshots
Data ponds typically organize data into files or partitions. A common structure involves creating a separate folder for each ingestion batch, often organized hierarchically by year, month, and day. Hive, a SQL interface for Hadoop, utilizes "partitioned tables" to optimize queries by reading only the specific folders relevant to a query (e.g., Year=2016 and Month=3).
To handle dimensional changes without the complexity of traditional SCDs, data ponds often use snapshots. Instead of tracking individual field changes, the system ingests the latest version of the master data every day.
/customers/Year=2016/Month=3/Day=2).Growing Data Ponds into Data Lakes A data pond becomes a data lake when it expands beyond the scope of a data warehouse to include data that was previously discarded or ignored due to lack of immediate business value or cost constraints.
1. Storing Raw Data Traditional warehousing involves converting data to a clean, normalized format, which inevitably leads to information loss.
2. Managing External Data Enterprises purchase significant amounts of external data (credit reports, Nielsen ratings, etc.). Managing this data presents challenges regarding quality, licensing costs (duplicate purchases by different teams), and intellectual property rights.
/Data/External/<vendor>/<dataset>) or a centralized catalog allows the organization to track what has been purchased. This prevents different teams from buying the same data twice and allows for consistent data quality management and lineage tracking.3. IoT and Streaming Data Machine-generated data (IoT) is rapidly outpacing human-generated data. This data is often streamed for real-time monitoring. To interpret real-time behavior (e.g., anomaly detection), it must be compared against historical baselines stored in the data lake.
Real-Time Data Lakes and the Lambda Architecture Integrating real-time data streams with historical data analysis requires specific architectural patterns.
Data Transformations in the Lake While data lakes allow for schema-on-read, transforming data is still necessary for harmonization, entity resolution, and performance optimization.
Harmonization and Entity Resolution In a warehouse, all data must be harmonized (mapped to a common schema) and resolved (linking different instances of the same entity) before loading. This is expensive and forces the discard of non-conforming attributes. In a data lake, these processes can be performed on demand. Data is ingested raw. Entity resolution is applied only to the specific entities and attributes needed for a current project, rather than attempting to resolve every conflict in the enterprise at once. This allows for flexible resolution strategies; for example, a marketing project might resolve entities based on whether a person ever lived in a city, whereas a billing project needs the current address.
ETL Offloading A primary use case for data lakes is "ETL Offloading." Traditional ETL (Extract, Transform, Load) tools often struggle to scale. High-end database vendors promoted ELT (Extract, Load, Transform) to use the database engine for processing, but this consumes expensive compute resources. Hadoop offers a cost-effective alternative. In this model:
Target Systems Data in the lake is consumed by various downstream systems:
Conclusion The data lake offers a powerful evolution from the data pond/warehouse model. By serving as a central repository for raw, external, and processed data, and by leveraging low-cost storage to preserve full history, it enables advanced analytics and self-service while continuing to support traditional warehousing needs through ETL offloading.