Data lakehouses

A data lakehouse is a type of data product that serves as both a data lake and a data warehouse, with the idea that data lakes bring "the best of both worlds" together.

Typical data lake and data warehouse architecture treats each system independently, which increases the overall operational overhead and increases the overall data volume by requiring multiple copies of the same data. Data lakehouse architecture attempts to reduce that overhead by combining both architectures into a single system that can, when appropriate, share data in multiple contexts.

Properties of an effective data lakehouse

An effective data lakehouse...

  • Handles all types of data including structured, semi-structured, and unstructured
  • Provides the ability to explore, preview, and query data through both visualizations and code (e.g. SQL)
  • Reduces the need for ETLs through the ability to query the same data in multiple contexts
  • Manages meta data for searching, tracking, and sharing data.
  • Stores data in open data formats such as Parquet, JSON, and CSV.
  • Decouples storage from compute so that each can scale independently
  • Includes integrated security and governance controls that allow data to be properly classified and protected.
  • Serves all data functions within the organization.
  • Is cost effective compared to distinct data lakehouse and data warehouses solutions.

Deeper Knowledge on Data Lakehouses

Azure Synapse Analytics

An integrated set of data services on Microsoft Azure

Broader Topics Related to Data Lakehouses

Data Lakes

Centralized repositories of raw data from a wide range of sources

Data Products

Ways of making data available

Data warehouses

Data warehouses: Architecture, data flow, and related topics

Data Lakehouses Knowledge Graph