Content - leuduan.work

Chapter 5: From Data Ponds/Big Data Warehouses to Data Lakes

This chapter explores the transition from traditional data warehousing to modern data lake architectures. It details how data lakes can replicate the essential functions of data warehouses while overcoming their limitations regarding scale, cost, and agility. The chapter further explains how to expand a "data pond" (a data warehouse built on big data technology) into a full "data lake" by incorporating raw, external, and streaming data, and how to manage transformations and downstream consumption.

Essential Functions of a Data Warehouse To understand the utility of a data lake, one must first understand the functions traditionally performed by data warehouses. Originally envisioned as repositories for all historical enterprise data, data warehouses evolved into highly managed systems containing production-quality data for critical analytics. However, due to high costs and performance tuning requirements, they became rigid; changes required lengthy architectural reviews, preventing ad hoc querying and agility.

Despite these limitations, data warehouses perform four critical functions that data lakes must also address to be successful:

Dimensional Modeling: Data is organized to facilitate high-performance analytics, often using star schemas.
Integration: Data from disparate systems is integrated into a consistent representation (e.g., conforming dimensions).
History Preservation: Changes in data are managed to allow accurate historical analysis (Slowly Changing Dimensions).
Data Quality: Data is cleansed and made consistent.

Dimensional Modeling and Schema Design Operational systems utilize normalized data models to minimize redundancy and optimize for small, fast transactions. in contrast, data warehouses utilize denormalized models (such as star schemas) where tables contain as many attributes as possible to allow analytical applications to process information in a single pass. A star schema typically consists of dimension tables (representing entities like customers or products) and fact tables (representing activities like orders).

Because data lakes use massively scalable and cost-effective technology (like Hadoop), they do not strictly require dimensional modeling or aggregation for performance reasons alone. However, the logical challenges of mapping diverse source data into a common format for analysis remain.

Preserving History and Slowly Changing Dimensions A critical limitation of traditional data warehouses is the high cost of storage and processing, which forces enterprises to aggregate or archive older data. For example, a warehouse might keep detailed transactions for only three years, summarizing older data into daily or monthly totals. This aggregation results in a loss of detail, complicates query writing, and limits historical trend analysis.

Big data platforms allow enterprises to store historical data at the finest granularity indefinitely. This enables richer analysis, such as the case of Visa, which moved from analyzing 2% of transactions to 100% of transactions using Hadoop, drastically improving fraud detection.

In traditional warehouses, Slowly Changing Dimensions (SCD) are used to track changes in entity states (e.g., a customer changing marital status) so that transactions are analyzed against the correct historical state of the customer. Implementing SCDs in a data lake can be complex because it requires joining temporal data during analysis.

Keeping History in a Data Pond: Partitioning and Snapshots Data ponds typically organize data into files or partitions. A common structure involves creating a separate folder for each ingestion batch, often organized hierarchically by year, month, and day. Hive, a SQL interface for Hadoop, utilizes "partitioned tables" to optimize queries by reading only the specific folders relevant to a query (e.g., Year=2016 and Month=3).

To handle dimensional changes without the complexity of traditional SCDs, data ponds often use snapshots. Instead of tracking individual field changes, the system ingests the latest version of the master data every day.

The Structure: A directory tree is created where each day’s folder contains a complete snapshot of the data set for that day (e.g., /customers/Year=2016/Month=3/Day=2).
The Benefit: To analyze a transaction, the query joins the transaction record with the customer snapshot from the same date.
Feasibility: While storing a full copy of the customer database daily would be prohibitively expensive in a traditional database, the low cost of Hadoop storage makes this "snapshot" approach viable. It preserves history for all attributes, not just those selected for SCD tracking, and simplifies ingestion.

Growing Data Ponds into Data Lakes A data pond becomes a data lake when it expands beyond the scope of a data warehouse to include data that was previously discarded or ignored due to lack of immediate business value or cost constraints.

1. Storing Raw Data Traditional warehousing involves converting data to a clean, normalized format, which inevitably leads to information loss.

Data Breadth: Warehouses discard attributes that aren't immediately critical to save costs. Data lakes can store all attributes via frictionless ingestion.
Original Values: Warehouses often normalize values (e.g., replacing NULLs with defaults). Data lakes keep the raw data, allowing data scientists to distinguish between a value of "0" and a value of "unknown," which is critical for accurate modeling (data interpolation).
Non-tabular Formats: Warehouses struggle with non-relational data. Data lakes natively handle JSON, XML, log files, and columnar formats (like Parquet), which are essential for social media and web log analysis.

2. Managing External Data Enterprises purchase significant amounts of external data (credit reports, Nielsen ratings, etc.). Managing this data presents challenges regarding quality, licensing costs (duplicate purchases by different teams), and intellectual property rights.

The Data Lake Solution: The lake serves as a central repository for external data. A folder hierarchy (e.g., /Data/External/<vendor>/<dataset>) or a centralized catalog allows the organization to track what has been purchased. This prevents different teams from buying the same data twice and allows for consistent data quality management and lineage tracking.

3. IoT and Streaming Data Machine-generated data (IoT) is rapidly outpacing human-generated data. This data is often streamed for real-time monitoring. To interpret real-time behavior (e.g., anomaly detection), it must be compared against historical baselines stored in the data lake.

Real-Time Data Lakes and the Lambda Architecture Integrating real-time data streams with historical data analysis requires specific architectural patterns.

The Lambda Architecture: Developed to handle massive data robustness and low-latency reads, this architecture splits data flow into three layers:
1. Batch Layer: Stores the master data set (e.g., in HDFS) and pre-computes batch views.
2. Speed Layer: Processes recent data in real-time (e.g., using Spark or Storm) to provide low-latency views.
3. Serving Layer: Indexes the batch views for querying. Queries can merge results from both the Batch and Speed layers to provide a complete picture.
The Kappa Architecture: A simpler alternative that uses a distributed log (like Kafka) at its core for both processing and storage.

Data Transformations in the Lake While data lakes allow for schema-on-read, transforming data is still necessary for harmonization, entity resolution, and performance optimization.

Harmonization and Entity Resolution In a warehouse, all data must be harmonized (mapped to a common schema) and resolved (linking different instances of the same entity) before loading. This is expensive and forces the discard of non-conforming attributes. In a data lake, these processes can be performed on demand. Data is ingested raw. Entity resolution is applied only to the specific entities and attributes needed for a current project, rather than attempting to resolve every conflict in the enterprise at once. This allows for flexible resolution strategies; for example, a marketing project might resolve entities based on whether a person ever lived in a city, whereas a billing project needs the current address.

ETL Offloading A primary use case for data lakes is "ETL Offloading." Traditional ETL (Extract, Transform, Load) tools often struggle to scale. High-end database vendors promoted ELT (Extract, Load, Transform) to use the database engine for processing, but this consumes expensive compute resources. Hadoop offers a cost-effective alternative. In this model:

Operational data is ingested into Hadoop in its raw format.
Hadoop (using MapReduce or Spark) performs the heavy transformations.
The processed data is loaded into the Data Warehouse for end-user analytics. This establishes the foundation of a data lake containing a Raw Zone (original data) and a Gold Zone (transformed data), both of which remain available for other analytics beyond just feeding the warehouse.

Target Systems Data in the lake is consumed by various downstream systems:

Data Warehouses: Fed by bulk-loading processed files from the lake.
Operational Data Stores (ODS): The lake can replace the ODS, consolidating data cleansing and normalization to protect analytical workloads from ELT resource contention.
Real-Time Applications: The lake supports "Data Products"—production deployments of statistical models. Outputs include:
- Dashboards: Real-time state displays (e.g., stock tickers).
- Automated Actions (CEP): Complex Event Processing that triggers actions based on specific conditions (e.g., automated bidding).
- Alerts: Notifications triggered by complex logic involving both real-time and historical data.
- Data Sets: Bulk outputs, such as customer segmentation lists.

Conclusion The data lake offers a powerful evolution from the data pond/warehouse model. By serving as a central repository for raw, external, and processed data, and by leveraging low-cost storage to preserve full history, it enables advanced analytics and self-service while continuing to support traditional warehousing needs through ETL offloading.