How to Get Distinct Combinations of Multiple Columns in a PySpark DataFrame
How can we get all unique combinations of multiple columns in a PySpark DataFrame?
Suppose we have a DataFrame df
with columns col1
and col2
.
We can easily return all distinct values for a single column using distinct()
.
df.select('col1').distinct().collect()
# OR
df.select('col1').distinct().rdd.map(lambda r: r[0]).collect()
How can we get only distinct pairs of values in these two columns?
Get distinct pairs
We can simply add a second argument to distinct()
with the second column name.
df.select('col1','col2').distinct().collect()
# OR
df.select('col1','col2').distinct().rdd.map(lambda r: r[0]).collect()
Get distinct combinations for all columns
We can also get the unique combinations for all columns in the DataFrame using the asterisk *
.
df.select('*').distinct().collect()
# OR
df.select('*').distinct().rdd.map(lambda r: r[0]).collect()