How to Convert a DataFrame Column Type from String to Timestamp in PySpark


How can we convert a column type from string to timestamp in a PySpark DataFrame?

Suppose we have a DataFrame df with column date of type string.

This column might have strings like this:

2022-01-04 10:41:05

Or maybe something funky like this:

2022_01_04 10_41_05

Let’s say we want to cast either of these columns into type timestamp.

Luckily, Column provides a cast() method to convert columns into a specified data type.

Cast standard timestamp formats

If our timestamp is standard (i.e. follows the yyyy-MM-dd HH:mm:ss.SSSS format), we can use either cast() or to_timestamp() to perform the cast.

Let’s say we wanted to cast the string 2022-01-04 10:41:05.

Cast using cast()

Here’s how we can cast using cast().

from pyspark.sql.types import TimestampType
df = df.withColumn("date", df["date"].cast(TimestampType()))

We can also use col() with simple strings.

from pyspark.sql.functions import col
df = df.withColumn("date", col("date").cast("timestamp"))

Cast using to_timestamp()

Here’s how we can cast using to_timestamp().

from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import TimestampType
df = df.withColumn("date", to_timestamp("date", TimestampType())) 

Keep in mind that both of these methods require the timestamp to follow this yyyy-MM-dd HH:mm:ss.SSSS format.

Read more on to_timestamp() in the PySpark documentation.

Cast abnormal timestamp formats

If we want to cast an abnormally formatted string into a timestamp, we’ll have to specify the format in to_timestamp().

Let’s say we wanted to cast the string 2022_01_04 10_41_05.

Cast using to_timestamp()

If we’re running Spark 2.2 or higher, we can cast easily with to_timestamp().

from pyspark.sql.functions import to_timestamp
df = df.withColumn("date", to_timestamp("date", "yyyy_MM_dd HH_mm_ss"))

Cast using unix_timestamp()

If we’re running < Spark 2.2, we can cast easily with unix_timestamp().

from pyspark.sql.functions import unix_timestamp
df = df.withColumn("date", unix_timestamp("date", "yyyy_MM_dd HH_mm_ss"))

Quick caveat: unix_timestamp() only has second-precision, as opposed to millisecond-precision. If we’re working with a standard format, it’s desired to just use the cast() method above, which can handle milliseconds.