What is the difference between sort() and orderBy() in Spark?


What is the difference between sort() and orderBy() in the Spark API?

SORT BY and ORDER BY are different in Spark SQL

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered.

The SORT BY clause can be found in the Spark SQL documentation.

The ORDER BY clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause guarantees a total order in the output.

The ORDER BY clause can be found in the Spark SQL documentation.

sort() and orderBy() are the same in the DataFrame API

So, if SORT BY and ORDER BY are different in Spark SQL, how are they the same in the Spark DataFrame API?

Let’s first look at some languages supported by Spark.

sort() and orderBy() both perform whole ordering of the dataset, like ORDER BY.

sortWithinPartitions() performs partition wise ordering, like SORT BY.