How to Convert a DataFrame Column Type from String to Timestamp in PySpark
How can we convert a column type from string to timestamp in a PySpark DataFrame?
Suppose we have a DataFrame df with column date of type string.
This column might have strings like this:
2022-01-04 10:41:05
Or maybe something funky like this:
2022_01_04 10_41_05
Let’s say we want to cast either of these columns into type timestamp.
Luckily, Column provides a cast() method to convert columns into a specified data type.
Cast standard timestamp formats
If our timestamp is standard (i.e. follows the yyyy-MM-dd HH:mm:ss.SSSS format), we can use either cast() or to_timestamp() to perform the cast.
Let’s say we wanted to cast the string 2022-01-04 10:41:05.
Cast using cast()
Here’s how we can cast using cast().
from pyspark.sql.types import TimestampType
df = df.withColumn("date", df["date"].cast(TimestampType()))
We can also use col() with simple strings.
from pyspark.sql.functions import col
df = df.withColumn("date", col("date").cast("timestamp"))
Cast using to_timestamp()
Here’s how we can cast using to_timestamp().
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import TimestampType
df = df.withColumn("date", to_timestamp("date", TimestampType()))
Keep in mind that both of these methods require the timestamp to follow this yyyy-MM-dd HH:mm:ss.SSSS format.
Read more on
to_timestamp()in the PySpark documentation.
Cast abnormal timestamp formats
If we want to cast an abnormally formatted string into a timestamp, we’ll have to specify the format in to_timestamp().
Let’s say we wanted to cast the string 2022_01_04 10_41_05.
Cast using to_timestamp()
If we’re running Spark 2.2 or higher, we can cast easily with to_timestamp().
from pyspark.sql.functions import to_timestamp
df = df.withColumn("date", to_timestamp("date", "yyyy_MM_dd HH_mm_ss"))
Cast using unix_timestamp()
If we’re running < Spark 2.2, we can cast easily with unix_timestamp().
from pyspark.sql.functions import unix_timestamp
df = df.withColumn("date", unix_timestamp("date", "yyyy_MM_dd HH_mm_ss"))
Quick caveat:
unix_timestamp()only has second-precision, as opposed to millisecond-precision. If we’re working with a standard format, it’s desired to just use thecast()method above, which can handle milliseconds.