Data lakes

A data lake is a type of data product that serves as a centralized repository of raw data sourced from a wide range of sources and stored for convenient retrieval for analysis.

Data lakes generally store data "as is" - whether structured, semi-structured, or unstructured - with little or no transformation during ingestion. This makes data lakes relatively easy and inexpensive to create, especially compared with the highly formalized structures of data warehouses.

Data lakes can also be especially useful in data science because they can store data not traditionally kept in databases, such as audio and video files that may contain important data to train machine learning models alongside more "traditional", tabular data.

Data swamps: Challenges with data lakes

Poorly maintained data lakes are often referred to as data swamps. When data in a data lake is not subject to effective governance or gains a reputation for being unreliable, the usefulness of a data lake can quickly decline.

Additionally, access controls can be difficult to establish in data lakes due to their centralized nature. Personally identifiable information, intellectual property, and other secure data can be easily leaked if not carefully controlled. This need to control access on a granular level can negate some of the apparent simplicity of a data lake. This means that most data lakes are not suitable for business intelligence workloads.

Deeper Knowledge on Data Lakes

Azure Data Lake Gen2

A data lake storage solution built on Azure Blob Storage

Azure Synapse Analytics

An integrated set of data services on Microsoft Azure

Data Lakehouses

A combination of data lakes and data warehouses

Broader Topics Related to Data Lakes

Data Products

Ways of making data available

Data Lakes Knowledge Graph