How to Substract String Timestamps From Two Columns in PySpark


How can we subtract string timestamps from two columns in a PySpark DataFrame?

Suppose we have a DataFrame df with the columns start and end, both of which are of type string.

They might hold datetime strings similar to the ones below.

+-------------------+-------------------+
|              start|                end|
+-------------------+-------------------+
|2022-01-09 01:00:00|2022-01-09 01:01:01|
|2022-01-09 06:00:00|2022-01-09 06:21:04|
|2022-01-09 20:00:00|2022-01-09 20:50:20|
+-------------------+-------------------+

Convert to string type to timestamp type

If we’re running Spark 2.2 or higher, we can use to_timestamp().

from pyspark.sql.functions import to_timestamp
df = df.withColumn('start', to_timestamp('start', 'MM-dd-yyyy HH:mm:ss'))

If we’re running < Spark 2.2, we can use unix_timestamp().

from pyspark.sql.functions import unix_timestamp
df = df.withColumn('start', unix_timestamp('start', 'MM-dd-yyyy HH:mm:ss'))

Subtract timestamp columns

We can now subtract the two columns to find the time difference.

Let’s see what that would look like with to_timestamp() and the default formatting (no second argument).

from pyspark.sql.functions import to_timestamp
delta = to_timestamp('end') - to_timestamp('start')
df = df.withColumn('Duration', delta)