PySpark: How to select rows where any column contains a null value

When performing exploratory data analysis in PySpark, it is often useful to find rows that contain nulls in any column. This recipe filters a dataframe to include only rows in which one or more columns is null. It works by generating a condition for each column to check for null and combining each condition through a series of OR (|) statements.

from pyspark.sql.functions import col
from functools import reduce

def select_rows_with_nulls(from_df):
	return from_df.where(
		reduce(
			lambda col1, col2: col1 | col2, 
			[col(col_name).isNull() for col_name in from_df.columns]
		)
	)

Example usage

In the following example, the dataframe will only contain rows in which at least one column contains a null value.

df = select_rows_with_nulls(from_df=df)

Broader Topics Related to PySpark Recipe: Select rows where any column contains a null value

Exploratory Data Analysis (EDA)

Research into an unfamiliar dataset, aimed at pattern discovery, assumption verification, and data summarization

PySpark Recipes

Quick and easy to copy recipes for PySpark

PySpark Recipe: Select rows where any column contains a null value Knowledge Graph