Data lakehouses
A data lakehouse is a type of data product that serves as both a data lake and a data warehouse, with the idea that data lakes bring "the best of both worlds" together.
Typical data lake and data warehouse architecture treats each system independently, which increases the overall operational overhead and increases the overall data volume by requiring multiple copies of the same data. Data lakehouse architecture attempts to reduce that overhead by combining both architectures into a single system that can, when appropriate, share data in multiple contexts.
Properties of an effective data lakehouse
An effective data lakehouse...
- Handles all types of data including structured, semi-structured, and unstructured
- Provides the ability to explore, preview, and query data through both visualizations and code (e.g. SQL)
- Reduces the need for ETLs through the ability to query the same data in multiple contexts
- Manages meta data for searching, tracking, and sharing data.
- Stores data in open data formats such as Parquet, JSON, and CSV.
- Decouples storage from compute so that each can scale independently
- Includes integrated security and governance controls that allow data to be properly classified and protected.
- Serves all data functions within the organization.
- Is cost effective compared to distinct data lakehouse and data warehouses solutions.
Deeper Knowledge on Data Lakehouses
Azure Synapse Analytics
An integrated set of data services on Microsoft Azure
Broader Topics Related to Data Lakehouses
Data Analysis
The transformation of data to information
Data Lakes
Centralized repositories of raw data from a wide range of sources
Data Science
The scientific method applied to data analysis
Data Products
Ways of making data available
Data warehouses
Data warehouses: Architecture, data flow, and related topics