Exploratory data analysis (EDA)
Exploratory data analysis (EDA) an aspect of data analysis and data wrangling that consists of preliminary investigations into an unfamiliar dataset that aim to detect patterns, identify anomalies, check assumptions, and summarize the data to ensure it is well understood before using it in a broader context, such as using it as machine learning input.
EDA often starts simply, by determining the numbers of columns (characteristics) and rows (observations), whether data is missing or corrupt, and the data type of each column. Basic aggregate functions can be applied to values to determine the count, mean, standard deviation, minimum, and maximum values for each characteristic. Data may also be broken down into quartiles or other bins to note various tendencies the data may have.
Note: In Pandas, the
data_frame.describe()functions can help make quick work of the first few steps of exploratory data analysis.
Visualizations can help with additional analysis. Common visualizations include:
- A correlation matrix plots the correlations between characteristics in the dataset
- A box-and-whisker plot illustrates the distribution of data for easy comparison between variables
- A distribution plot visualizes how each characteristic is distributed within the dataset
Note: Seaborn is a useful data visualization library for Python.
Deeper Knowledge on Exploratory Data Analysis (EDA)
PySpark Recipe: Select rows where any column contains a null value
A PySpark recipe to select rows with a null value in any column
Broader Topics Related to Exploratory Data Analysis (EDA)
The transformation of data to information
Transforming "raw" data into a more easily analyzed form through normalization and format standardization