PySpark: How to Conditionally replace a column's values

When data cleansing in PySpark, it is often useful to replace inconsistent values with consistent values. For example: "M" and "m" may both be values in a gender column. This recipe replaces values in a data frame column with a single value based on a condition:

from pyspark.sql.functions import col

def replace_values(in_df, in_column_name, on_condition, with_value):
    return in_df.withColumn(
        in_column_name,
        when(
            on_condition, 
			with_value
        ).otherwise(
            col(in_column_name)
        )
    )

Example usage

In this example, we replace multiple possible values for gender with "unspecified".

from pyspark.sql.functions import lower, col, trim

df = replace_values(
	in_df=df,
	in_column_name="gender",
	on_condition=(
		(lower(trim(col("gender"))) == "unknown") |
		(lower(trim(col("gender"))) == "not set") |
		(lower(trim(col("gender"))) == "prefer not to say") |
		(col("gender").isNull())
	),
	with_value="unspecified"
)

Broader Topics Related to PySpark Recipe: Conditionally replace a column's values

PySpark Recipes

Quick and easy to copy recipes for PySpark

PySpark Recipe: Conditionally replace a column's values Knowledge Graph