A data lake is a type of data product that serves as a centralized repository of raw data sourced from a wide range of sources and stored for convenient retrieval for analysis.
Data lakes generally store data "as is" - whether structured, semi-structured, or unstructured - with little or no transformation during ingestion. This makes data lakes relatively easy and inexpensive to create, especially compared with the highly formalized structures of data warehouses.
Data lakes can also be especially useful in data science because they can store data not traditionally kept in databases, such as audio and video files that may contain important data to train machine learning models alongside more "traditional", tabular data.
Data swamps: Challenges with data lakes
Poorly maintained data lakes are often referred to as data swamps. When data in a data lake is not subject to effective governance or gains a reputation for being unreliable, the usefulness of a data lake can quickly decline.
Additionally, access controls can be difficult to establish in data lakes due to their centralized nature. Personally identifiable information, intellectual property, and other secure data can be easily leaked if not carefully controlled. This need to control access on a granular level can negate some of the apparent simplicity of a data lake. This means that most data lakes are not suitable for business intelligence workloads.
Deeper Knowledge on Data Lakes
Azure Data Lake Gen2
A data lake storage solution built on Azure Blob Storage
Azure Synapse Analytics
An integrated set of data services on Microsoft Azure
A combination of data lakes and data warehouses
Broader Topics Related to Data Lakes
Ways of making data available