Big data

Big data is a blanket term for the data engineering and data analysis challenges that arise from working with extremely large datasets over distributed systems.

In other words, if you ask an analytics team for a report and the report takes too long to run, will slow down a production system, requires too many steps to complete, or if the data is too difficult to identify and access, then it's probably a "big data" problem.

Challenges with big data: Variety, velocity, veracity, volume

Often referred to as the four 'V's, the challenges of big data center on variety, velocity, veracity, and volume.

Variety of data refers to the formats and sources of raw data. Data can come in "typical" forms - like databases, spreadsheets, and other tabular forms (e.g. CSV files) - but also from sources like audio, video, and unstructured text. Data can also be sourced from a variety of sources in addition to databases, including: APIs, IoT devices, and biomedical devices. Each format and source of data requires additional effort to capture and process.

Velocity of data refers to the rate at which data is generated and processed. Data may be generated and processed in real-time, in batches, or sporadically. As the velocity of data increases, so does the complexity and effort required to process it dependably.

Veracity of data refers to the trustworthiness, accuracy, and consistency of data. Sources of data may be biased in various ways, contain errors or conflicts, or be full of abnormalities. This translate to a lot of potential work to validate, explain, and compensate for issues in the data or with data sources.

Volume of data refers to the sheer size of the data to be processed. It's common for enterprises to require petabytes of storage to house all the data required to help shape the company's future. This level of storage requires significant architecture to ensure availability and accessibility of data.

Big Data Resources

Broader Topics Related to Big Data

Distributed Systems

Distributed Systems

Software systems distributed and coordinated over a network

Business Intelligence

Business Intelligence

Methods to bridge the gap between data and business

Data Engineering

Data Engineering

Engineering approaches to data management

Big Data Knowledge Graph