How to Sort a DataFrame in Descending Order in PySpark
How can we sort a DataFrame in descending order based on a particular column in PySpark?
Suppose we have a DataFrame df
with the column col
.
We can achieve this with either sort()
or orderBy()
.
Sort using sort()
or orderBy()
We can use sort()
with col()
or desc()
to sort in descending order.
Note that all of these examples below can be done using orderBy()
instead of sort()
.
Sort with external libraries
We can sort with col()
.
from pyspark.sql.functions import col
df.sort(col('col').desc()))
We can also sort with desc()
.
from pyspark.sql.functions import desc
df.sort(desc('col'))
Both are valid options, but let’s try to avoid external libraries.
Sort without external libraries
df.sort(df.col.desc())
# OR
df.sort('col', ascending=False)
Remember that all of the examples above can be done using
orderBy()
instead ofsort()
.
Sort multiple columns
Suppose our DataFrame df
had two columns instead: col1
and col2
.
Let’s sort based on col2
first, then col1
, both in descending order.
We’ll see the same code with both sort()
and orderBy()
.
from pyspark.sql.functions import col
df.sort(col("col2").desc, col("col1").desc)
df.orderBy(col("col2").desc, col("col1").desc)
Let’s try without the external libraries.
df.sort(['col2', 'col1'], ascending=[0, 0])
df.orderBy(['col2', 'col1'], ascending=[0, 0])
To whom it may concern:
sort()
andorderBy()
both perform whole ordering of the dataset in this Spark DataFrame API.sort()
does not perform partition-wise ordering;sortWithinPartitions()
does.