How to Change a Column Type of a DataFrame in PySpark
How can we change the column type of a DataFrame in PySpark?
Suppose we have a DataFrame df
with column num
of type string
.
Let’s say we want to cast this column into type double
.
Luckily, Column
provides a cast()
method to convert columns into a specified data type.
Cast using cast()
and the singleton DataType
We can use the PySpark DataTypes
to cast a column type.
from pyspark.sql.types import DoubleType
df = df.withColumn("num", df["num"].cast(DoubleType()))
# OR
df = df.withColumn("num", df.num.cast(DoubleType()))
We can also use the col()
function to perform the cast.
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
df = df.withColumn("num", col("num").cast(DoubleType()))
Cast using cast()
and simple strings
We can also use simple strings.
from pyspark.sql.types import DoubleType
df = df.withColumn("num", df["num"].cast("double"))
# OR
df = df.withColumn("num", df.num.cast("double"))
Get simple string from DataType
Here is a list of DataTypes
to simple strings.
BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp
Simple strings for any DataType
can be obtained using getattr()
and simpleString()
.
We can get the simple string for any DataType
like so:
from pyspark.sql import types
simpleString = getattr(types, 'BinaryType')().simpleString()
from pyspark.sql.types import BinaryType
simpleString = BinaryType().simpleString()
We can also write out simple strings for arrays and maps: array<int>
and map<string,int>
.
Read more about
cast()
in the PySpark documentation.