PySpark: How to drop single-value columns

When data cleansing in PySpark, it can be useful to drop columns that only contain a single value; for example, columns with a single value are not typically useful when training a machine learning model. It works by retrieving the first row from the dataframe, then counting the number of rows containing columns with the same value as the corresponding column in the first row. Columns for which the count is equal to the count of the entire dataframe contain only a single value and may be dropped.

def drop_mono_columns(from_df):
    first_row = from_df.limit(1).collect()[0]
    candidates =[
			when(col(column) != first_row[column], column)
		for column in from_df.columns

    mono_cols = [key for key, value in candidates.items() if value == 0]

    return from_df.drop(*mono_cols) if len(mono_cols) > 0 else from_df

Drop columns containing only null values

The code above can be modified slightly so that only columns containing all null values are dropped. The following drop_null_columns function will remove only columns in which all values are null:

from pyspark.sql.functions import col, count, when

def drop_null_columns(from_df):
    This function drops columns that only contain null values.
    :param from_df: A PySpark DataFrame

    null_counts =
				when(col(column).isNull(), column)
            for column in from_df.columns

	row_count = from_df.count()
    cols_to_drop = [
		for col_name, col_val in null_counts.items() 
        if col_val == row_count

    return from_df.drop(*col_to_drop)

Broader Topics Related to PySpark Recipe: Drop single-value columns

PySpark Recipes

Quick and easy to copy recipes for PySpark

PySpark Recipe: Drop single-value columns Knowledge Graph