How to Convert a DataFrame Column Type from String to Timestamp in PySpark
How can we convert a column type from string
to timestamp
in a PySpark DataFrame?
Suppose we have a DataFrame df
with column date
of type string
.
This column might have strings like this:
2022-01-04 10:41:05
Or maybe something funky like this:
2022_01_04 10_41_05
Let’s say we want to cast either of these columns into type timestamp
.
Luckily, Column
provides a cast()
method to convert columns into a specified data type.
Cast standard timestamp formats
If our timestamp is standard (i.e. follows the yyyy-MM-dd HH:mm:ss.SSSS
format), we can use either cast()
or to_timestamp()
to perform the cast.
Let’s say we wanted to cast the string 2022-01-04 10:41:05
.
Cast using cast()
Here’s how we can cast using cast()
.
from pyspark.sql.types import TimestampType
df = df.withColumn("date", df["date"].cast(TimestampType()))
We can also use col()
with simple strings.
from pyspark.sql.functions import col
df = df.withColumn("date", col("date").cast("timestamp"))
Cast using to_timestamp()
Here’s how we can cast using to_timestamp()
.
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import TimestampType
df = df.withColumn("date", to_timestamp("date", TimestampType()))
Keep in mind that both of these methods require the timestamp to follow this yyyy-MM-dd HH:mm:ss.SSSS
format.
Read more on
to_timestamp()
in the PySpark documentation.
Cast abnormal timestamp formats
If we want to cast an abnormally formatted string into a timestamp, we’ll have to specify the format in to_timestamp()
.
Let’s say we wanted to cast the string 2022_01_04 10_41_05
.
Cast using to_timestamp()
If we’re running Spark 2.2 or higher, we can cast easily with to_timestamp()
.
from pyspark.sql.functions import to_timestamp
df = df.withColumn("date", to_timestamp("date", "yyyy_MM_dd HH_mm_ss"))
Cast using unix_timestamp()
If we’re running < Spark 2.2, we can cast easily with unix_timestamp()
.
from pyspark.sql.functions import unix_timestamp
df = df.withColumn("date", unix_timestamp("date", "yyyy_MM_dd HH_mm_ss"))
Quick caveat:
unix_timestamp()
only has second-precision, as opposed to millisecond-precision. If we’re working with a standard format, it’s desired to just use thecast()
method above, which can handle milliseconds.