The process for data wrangling is:
- Understand: Defining the meaning and relationships for each field. Summary statistics, data visualizations, and guidance from SMEs are used to scrutinize the data.
- Cleanse: Detect and address corrupt and missing data. Outlier detection and imputation are used to cleanse the data.
- Select: Remove unneeded data from the data set, and gather missing data that cannot be reliably inferred. Data sampling is used to systematically create smaller representative samples of larger datasets and feature selection is used to automatically identify the variables most relevant to the outcome variable.
- Prepare: Standardize and normalize the data into a consistent structure and format. Integer encoding and one-hot encoding are used to convert categorical data to numerical data to make it easier for a machine learning model to process.
The following steps are a simple example using standard deviation, but interquartile-range could be used just as easily:
- Calculate the mean and standard deviation (alternatively, interquartile range) of the data collection
- Set a cutoff () of three standard deviations (), or
- Set a lower-bound () of the mean minus the cutoff () and an upper-bound () of the mean plus the cutoff ()
All data points less than the lower-bound or greater than the upper-bound can be considered outliers.
The basic options for imputation are to do nothing, remove records with missing/corrupt values, or replace the missing/corrupt values (usually with the mean or mode value), or some combination of these options. It's generally best to test each option and compare the outcomes to determine the best approach.
Broader Topics Related to Data Wrangling
The scientific method applied to data analysis
Machine Learning (ML)
Machine learning terms, processes, and methods