What is the difference between sort() and orderBy() in Spark?
What is the difference between sort()
and orderBy()
in the Spark API?
SORT BY
and ORDER BY
are different in Spark SQL
The SORT BY
clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY
may return result that is partially ordered.
The
SORT BY
clause can be found in the Spark SQL documentation.
The ORDER BY
clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY
clause, this clause guarantees a total order in the output.
The
ORDER BY
clause can be found in the Spark SQL documentation.
sort()
and orderBy()
are the same in the DataFrame API
So, if SORT BY
and ORDER BY
are different in Spark SQL, how are they the same in the Spark DataFrame API?
Let’s first look at some languages supported by Spark.
- In Python,
orderBy()
is an alias ofsort()
, as seen in the PySpark source. - In Scala,
orderBy()
is an alias ofsort()
, as seen in the Spark Scala source - In Java,
orderBy()
is an alias ofsort()
, as seen in the Spark Java documentation
sort()
and orderBy()
both perform whole ordering of the dataset, like ORDER BY
.
sortWithinPartitions()
performs partition wise ordering, like SORT BY
.