PySpark Recipes

PySpark is an open source Python library used to access Apache Spark. Here are a few useful recipes for using PySpark.

Remove columns that contain only null values

Datasets occasionally contain columns that are useless because they only contain null values. The following drop_null_columns function will remove all such columns:

import pyspark.sql.functions as sqlf

def drop_null_columns(from_df):
    """
    This function drops columns that only contain null values.
    :param from_df: A PySpark DataFrame
    """

    null_counts = from_df.select(
        [
            sqlf.count(
				sqlf.when(sqlf.col(column).isNull(), column)
			).alias(column)
            for column in from_df.columns
        ]
    ).collect()[0].asDict()

	row_count = from_df.count()
    cols_to_drop = [
        col_name 
		for col_name, col_val in null_counts.items() 
        if col_val == row_count
    ]

    return from_df.drop(*col_to_drop)

Remove columns where every row contains the same value

Datasets occasionally contain columns that are useless because they only contain one value across all rows. The following drop_mono_columns function will remove all such columns (including columns containing all nulls):

import pyspark.sql.functions as sqlf

def drop_mono_columns(from_df):
    first_row = df_impressions.limit(1).collect()[0]
    candidates = df_impressions.select(
        [
            sqlf.count(
				sqlf.when(sqlf.col(column) != first_row[column], column)
			).alias(column)
            for column in df_impressions.columns
        ]
    ).collect()[0].asDict()

    mono_cols = [key for key, value in candidates.items() if value == 0]

    print(f"Dropping columns: {mono_cols}")

    if len(mono_cols) > 0:
        return from_df.drop(*mono_cols)
    else:
        return from_df

Broader Topics Related to PySpark Recipes for Data Cleansing, Analysis, and Science

Data Analysis

The transformation of data to information

Data Science

The scientific method applied to data analysis

Open-Source Software

Useful open source software projects

Python Open-Source Data Libraries

Python libraries commonly used in data science and analysis

PySpark Recipes for Data Cleansing, Analysis, and Science Knowledge Graph