How to Get the Time From a Timestamp Column in PySpark DataFrame
How can we extract the time from a timestamp column in a PySpark DataFrame?
Suppose we have a DataFrame df
with the column datetime
, which is of type timestamp
.
Column of type timestamp
We might have casted this column to be of type timestamp
using cast()
.
df = df.withColumn("datetime", col("datetime").cast("timestamp"))
We also could have used to_timestamp()
.
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import TimestampType
df = df.withColumn("datetime", to_timestamp("datetime", TimestampType()))
Either way, we have a timestamp
column called datetime
.
Get the time using date_format()
We can extract the time into a new column using date_format()
.
We can then specify the the desired format of the time in the second argument.
from pyspark.sql.functions import date_format
df = df.withColumn("time", date_format('datetime', 'HH:mm:ss'))
This would yield a DataFrame that looks like this.
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2022-01-09T01:00:00|01:00:00|
|2022-01-09T06:00:00|06:00:00|
|2022-01-09T20:00:00|20:00:00|
+-------------------+--------+
Read more about
date_format()
in the PySpark documentation.