Machine Learning (ML)

Machine Learning (ML) is an approach to artificial intelligence that combines statistics and data science to develop and applying algorithms that improve their output through experience without being explicitly programmed to do so; in other words, algorithms that can "learn" to detect patterns, make decisions, and predict outcomes.

Machine Learning Terminology

Data samplingSystematic creation of smaller representative samples of larger data sets
FeatureA variable with high relevancy to the outcome variable
Feature selectionAutomatic detection of variables most relevant to the outcome variable
ImputationCorrection of corrupt and missing values through inference
Integer encodingAssignment of an integer value to a categorical value, e.g. values "red", "green", and "blue" could be assigned integer values of 1, 2, and 3 respectively
One-hot encodingAssignment of a bit-mapped binary value to a set of categorical values, e.g. a "color" category with potential values of "red", "green", and "blue" could be mapped to three bits of 100, 010, and 001, respectively
Outcome variableThe value to be predicted by a Machine Learning Model
OutlierA observation significantly different from other observations of the same data

The Machine Learning Process

graph TD subgraph 1. Source the Data DB1[(Data)] --> Gather DB2[(Data)] --> Gather DB3[(Data)] --> Gather --> Raw[(Raw Data)] end subgraph 2. Wrangle the Data Raw --> Understand Understand -->|Work with SMEs| Summarize Understand --> Visualize Summarize --> Cleanse Visualize --> Cleanse -->|Imputation, Outlier Detection...| Cleansed[(Cleansed Data)] Cleansed --> Select -.->|Gather identified missing data| Gather Select --> Sample[Data Sampling] Select --> Features[Feature Selection] Sample --> Prepare Features --> Prepare Prepare --> Encode[Encode Categorical Data] Prepare --> Normalize Encode --> Data[(Prepared Data)] Normalize --> Data end subgraph 3. Model the Data Data --> Model end subgraph 4. Use the Model Model --> Use[Use the Model] end

Machine learning model evaluation

Typically, when a machine learning model is trained, some portion of the training data is withheld for use in model evaluation. The model is then used to predict the withheld data. The predictions are then compared to the actual values to derive an accuracy rate, which represents the overall accuracy of the model, and an error rate which represents the number of "bad" predictions made by the model.

Accuracy and error rates are useful; however, they treat all misclassifications as being equally bad. A confusion matrix plots the misclassifications to provide more detail on model accuracy.

For example, we may have a classification model that predicts whether a user will "like" or "dislike" a post on social media in which the model accurately predicts the user's input 60% of the time. The model therefore has a 60% accuracy rate and a 40% error rate. The confusion matrix for this model might look something like the following table, illustrating that the model performs better for predicting "like" classes than "dislike" classes.

Predicted class
Actual classLike31

Machine learning resources

Deeper Knowledge on Machine Learning (ML)

Classification Learning

A type of machine learning that classifies entities based on their characteristics

Data Mining

A guide to finding patterns and relationships in data

Data Wrangling

Transforming "raw" data into a more easily analyzed form through normalization and format standardization

Broader Topics Related to Machine Learning (ML)

Artificial Intelligence (AI)

The mimicking of human cognitive functions and behaviors by machines

Data Science

The scientific method applied to data analysis

Machine Learning (ML) Knowledge Graph