How to Remove Duplicate Columns on Join in a Spark DataFrame
How can we perform a join between two Spark DataFrames without any duplicate columns?
Example scenario
Suppose we have two DataFrames: df1
and df2
, both with columns col
.
We want to join df1
and df2
over column col
, so we might run a join like this:
joined = df1.join(df2, df1.col == df2.col)
Join DataFrames without duplicate columns
We can specify the join column using an array or a string to prevent duplicate columns.
joined = df1.join(df2, ["col"])
# OR
joined = df1.join(df2, "col")