How to Substract String Timestamps From Two Columns in PySpark
How can we subtract string timestamps from two columns in a PySpark DataFrame?
Suppose we have a DataFrame df
with the columns start
and end
, both of which are of type string
.
They might hold datetime strings similar to the ones below.
+-------------------+-------------------+
| start| end|
+-------------------+-------------------+
|2022-01-09 01:00:00|2022-01-09 01:01:01|
|2022-01-09 06:00:00|2022-01-09 06:21:04|
|2022-01-09 20:00:00|2022-01-09 20:50:20|
+-------------------+-------------------+
Convert to string
type to timestamp
type
If we’re running Spark 2.2 or higher, we can use to_timestamp()
.
from pyspark.sql.functions import to_timestamp
df = df.withColumn('start', to_timestamp('start', 'MM-dd-yyyy HH:mm:ss'))
If we’re running < Spark 2.2, we can use unix_timestamp()
.
from pyspark.sql.functions import unix_timestamp
df = df.withColumn('start', unix_timestamp('start', 'MM-dd-yyyy HH:mm:ss'))
Subtract timestamp columns
We can now subtract the two columns to find the time difference.
Let’s see what that would look like with to_timestamp()
and the default formatting (no second argument).
from pyspark.sql.functions import to_timestamp
delta = to_timestamp('end') - to_timestamp('start')
df = df.withColumn('Duration', delta)